**SICK: An Efficient Approach to JSON-Like Data Handling**
SICK is an innovative approach designed to manage JSON-like structures, supported by various libraries implementing its concepts. It offers several advantages over traditional JSON processing methods:
– Store JSON-like data in an efficient, indexed binary form.
– Avoid reading and parsing entire JSON files, accessing only the data you need just in time.
– Store multiple JSON-like structures within a single deduplicating storage.
– Implement perfect streaming parsers for JSON-like data.
– Efficiently stream updates for JSON-like data.
—
### The Challenges with JSON Parsing
JSON has a Type-2 grammar, which requires a pushdown automaton to parse properly. This characteristic makes it impossible to implement efficient streaming parsers for JSON. For example, imagine a huge hierarchy of deeply nested JSON objects: you cannot finish parsing the top-level object until processing the entire file.
JSON is commonly used for storing and transferring large amounts of data, and the volume of these transfers tends to grow over time. Consider a typical JSON configuration file for a large enterprise product. Because almost all JSON parsers are non-streaming, you must perform extensive work every time you deserialize a large JSON file. This includes reading it from disk, parsing it into an abstract syntax tree (AST) in memory, and mapping the raw JSON tree to object instances.
Even if you use token streams and know your object type ahead of time, dealing with a Type-2 grammar remains necessary. This process can be inefficient and lead to unnecessary delays, CPU spikes, and high memory consumption.
—
### Flattening and Deduplicating JSON Structures with SICK
To illustrate SICK’s approach, consider a small JSON example. We can build a table listing every unique value in the JSON, categorized by type, index, and whether it represents a root entry:
| Type | Index | Value | Is Root |
|——–|——–|——————————|————-|
| string | 0 | “some key” | No |
| string | 1 | “some value” | No |
| object | 0 | [string: 0, string: 1] | No |
| object | 1 | [string: 1, string: 0] | No |
| array | 0 | [object: 0, object: 0, object: 1] | Yes (file.json) |
We have flattened and deduplicated the initial JSON structure into a table. This representation enables many possibilities, such as streaming the table. Although the particular encoding shown here is inefficient, it is streamable and can incorporate removal messages to support arbitrary updates.
An interesting property is that when the stream contains no removal entries, it can be safely reordered. This encoding also reduces the need for full data accumulation; although, receivers may still need to buffer entries until they can be fully processed.
—
### Data Structures in SICK Encoding
In our table, the only complex data structures in the “Value” column are lists and pairs of (type, index), which we call **references**. A reference can be efficiently represented as a pair of integers, thus having a fixed byte length.
– A list of references can be represented as an integer indicating the list length followed by all references in binary form.
– A list of fixed-size scalar values can be represented similarly.
– Lists of variable-size values (e.g., a list of strings) are encoded using lengths followed by concatenated binary data.
For example, the list `[“abbccc”]` might be encoded as:
“`
3 0 2 3 a b bb ccc
“`
(without spaces in actual representation).
This encoding is also indexed and reusable for storing any lists of variable-length data. Because of this, SICK’s binary structure is indexed—once you know the index of an element, you can access it immediately.
—
### Key Features of SICK Encoding
SICK encoding follows the compositional style of JSON (primitive types plus lists and dictionaries), but with added power:
1. **Circular References**
The table may store circular references, something that JSON cannot represent natively. For example:
| Type | Index | Value | Is Root |
|——–|——-|————————–|———|
| object | 0 | [string: 0, object: 1] | No |
| object | 1 | [string: 1, object: 0] | No |
Circular references can be useful in complex data scenarios.
2. **Multiple JSON Files with Deduplication**
You can store multiple JSON files within a single table and achieve full deduplication of their content. To manage this, a separate attribute (like *is root*) tracks the name of the root entry representing each JSON file.
In practice, it is more convenient to create a dedicated **root** type, where each root value is a reference to its name and a reference to the actual JSON value encoded:
| Type | Index | Value |
|——–|——-|—————————-|
| string | 0 | “some key” |
| string | 1 | “some value” |
| string | 2 | “some value” |
| object | 0 | [string: 0, string: 1] |
| object | 1 | [string: 1, string: 0] |
| array | 0 | [object: 0, object: 0, object: 1] |
| root | 0 | [string: 2, array: 0] |
3. **Native Support for Custom Scalar Types**
SICK allows encoding custom scalar data types (such as timestamps) natively by introducing new type tags.
4. **Polymorphic Types**
Polymorphic types can be stored by introducing new type tags or type references.
—
### Available Implementations
Currently, SICK is implemented in:
– **C#**: Binary storage with encoder and decoder.
– **Scala**: Binary storage with encoder and decoder backed by Circe (a JSON library).
– **JavaScript**: Basic implementation backed by Scala.
At present, none of the implementations support streaming capabilities out of the box, though this feature may be added in the future. Contributions to add streaming support are welcome.
—
### SICK Type Markers
Each value type in SICK is represented by a one-byte unsigned integer type marker. The markers, their names, comments, lengths, and language mappings are as follows:
| Marker | Name | Comment | Length (bytes) | C# Mapping | Scala Mapping |
|——–|——–|—————————————–|—————-|—————————|—————————|
| 0 | TNul | Equivalent to JSON `null` | 4 | Stored in the marker | Stored in the marker |
| 1 | TBit | Boolean | 4 | Stored in the marker | Stored in the marker |
| 2 | TByte | Byte (unsigned in C#, signed in Scala) | 4 | byte | Byte (signed) |
| 3 | TShort | Signed 16-bit integer | 4 | short | Short |
| 4 | TInt | Signed 32-bit integer | 4 | int | Int |
| 5 | TLng | Signed 64-bit integer | 8 | long | Long |
| 6 | TBigInt| Variable length, prefixed length | Variable | BigInteger | BigInt |
| 7 | TDbl | 64-bit double | 8 | double | Double |
| 8 | TFlt | 32-bit float | 4 | float | Float |
| 9 | TBigDec| Variable length; scale/precision/signum/unscaled | Variable | Decimal (custom) | BigDecimal |
| 10 | TStr | UTF-8 String, variable length | Variable | string | String |
| 11 | TArr | List of array entries | Variable | List (references) | List (references) |
| 12 | TObj | List of object entries | Variable | Dictionary (references) | Map (references) |
| 15 | TRoot | Root entry: index of the name string + reference | 9 (4 + 5) | Custom | Custom |
Note: Array entries are represented solely by references.
—
### Current Limitations
The current SICK implementation imposes some limitations:
– Maximum object size: 65,534 keys.
– The order of object keys is not preserved.
– Maximum number of array elements: 2^32.
– Maximum number of unique values of the same type: 2^32.
These limitations stem from the size of offset pointers and counts stored on the binary level. They may be lifted in future versions by increasing pointer sizes. However, these limits are generally sufficient for real-world applications.
—
### Summary
SICK provides a powerful and efficient method to store, process, and stream JSON-like data. By moving from traditional JSON’s text-based, unindexed format to an indexed binary structure, SICK enables:
– Efficient random access to data without parsing entire files.
– Support for circular references and multiple JSON documents deduplication.
– Custom scalar and polymorphic types encoding.
– Potential streaming support for updates and parsing.
The open-source implementations are available in C#, Scala, and JavaScript, inviting the community to contribute and extend streaming capabilities in future releases.
—
*If you are interested in contributing or implementing streaming parsers for SICK, your collaboration is highly encouraged.*
https://github.com/7mind/sick