病态:为类JSON数据结构设计的索引去重二进制存储。
Sick: Indexed deduplicated binary storage for JSON-like data structures

原始链接: https://github.com/7mind/sick

## SICK:一种用于JSON类似数据的有效二进制格式 SICK是一个系统,用于使用高效、索引的二进制格式(EBA – 高效二进制聚合)存储和处理JSON类似数据。与需要完整文件读取和复杂AST创建的传统JSON解析不同,SICK允许即时访问特定数据、在多个结构中进行去重,以及潜在的流式更新。 核心思想是将JSON结构扁平化为唯一值表,并通过索引进行引用。这使得流式编码和解码成为可能,克服了JSON作为Type-2语法语言的固有局限性。虽然编码过程比标准JSON更复杂,但它避免了解析大型、嵌套JSON文件的性能瓶颈。 SICK扩展了JSON的功能,支持循环引用、多文件去重和自定义数据类型。目前C#和Scala中存在实现,还有一个基本的JavaScript版本,并且已被用于具有大量用户基础的生产应用程序中。未来的开发重点是添加完整的流式传输功能,并通过社区贡献鼓励更广泛的应用。

## Sick:用于JSON类似数据的有效二进制存储 “Sick”是一个新的库,提供索引、去重后的二进制存储,用于JSON类似的数据结构,旨在实现高效的解析,尤其是在处理大型文件时。虽然它被描述为适用于JSON,但其核心优势在于其独特的二进制表示形式,专为速度和紧凑性而设计。 一个关键的限制是对象最多支持65,534个键,这是由于使用了2字节指针造成的。开发者讨论了潜在的解决方案,例如增加指针大小,但承认这会带来效率方面的权衡。 许多评论者分享了在实际应用中需要超过此限制的经验,例如存储Redis数据或本地化映射。 该库优先考虑小型对象的紧凑表示和快速访问,假设较大的结构可以外部分块。它不一定关注通用的JSON流式传输,而是关注在二进制格式*内*高效地访问数据。虽然一些人认为JSON的无序特性使得该项目名称不恰当,但创建者强调这是一种JSON数据的特定*表示形式*。 该项目正在积极开发中,创建者欢迎贡献,并会根据内部需求或客户请求考虑语言绑定。SQLite等替代方案已被考虑过,但由于目标用例的尺寸和性能问题,被认为不太合适。
相关文章

原文

Build Latest Release Maven Central Latest version

SICK is an approach to handle JSON-like structures and various libraries implementing it.

SICK allows you to achieve the following:

  1. Store JSON-like data in efficient indexed binary form
  2. Avoid reading and parsing whole JSON files and access only the data you need just in time
  3. Store multiple JSON-like structures in one deduplicating storage
  4. Implement perfect streaming parsers for JSON-like data
  5. Efficiently stream updates for JSON-like data

The tradeoff for these benefits is somehow more complicated and less efficient encoder.

JSON has a Type-2 grammar and requires a pushdown automaton to parse it. So, it's not possible to implement efficient streaming parser for JSON. Just imagine a huge hierarchy of nested JSON objects: you won't be able to finish parsing the top-level object until you process the whole file.

JSON is frequently used to store and transfer large amounts of data and these transfers tend to grow over time. Just imagine a typical JSON config file for a large enterprise product.

The non-streaming nature of almost all the JSON parsers requires a lot of work to be done every time you need to deserialize a huge chunk of JSON data: you need to read it from disk, parse it in memory into an AST representation, and, usually, map raw JSON tree to object instances. Even if you use token streams and know the type of your object ahead of time you still have to deal with the Type-2 grammar.

This may be very inefficient and causes unnecessary delays, pauses, CPU activity and memory consumption spikes.

Let's assume that we have a small JSON:

[
    {"some key": "some value"},
    {"some key": "some value"},
    {"some value": "some key"},
]

Let's build a table for every unique value in our JSON :

Type index Value Is Root
string 0 "some key" No
string 1 "some value" No
object 0 [string:0, string:1] No
object 1 [string:1, string:0] No
array 0 [object:0, object:0, object:1] Yes (file.json)

We just built a flattened and deduplicated version of our initial JSON structure.

Such representation allows us to do many different things, for example we may stream our table:

string:0 = "some key"
string:1 = "some value"

object:0.size = 2
object:0[string:0] = string:1
object:1[string:1] = string:0

array:0.size = 2
array:0[0] = object:0
array:0[1] = object:1

string:2 = "file.json"

root:0=array:0,string:2

This particular encoding is inefficient but it's streamable and, moreover, we can add removal message into it thus supporting arbitrary updates:

array:0[0] = object:1
array:0[1] = remove

There is an interesting observation: when a stream does not contain removal entries it can be safely reordered.

Also this representation eliminates many cases where full accumulation is required. Obviously, not all of them, the receiver still may need to accumulate the entries in a buffer until it can sort them out.

Binary format: EBA (Efficient Binary Aggregate)

We may note that the only complex data structures in our "Value" column are lists and (type, index) pairs. Let's call such pairs "references".

A reference can be represented as a pair of integers, so it would have a fixed byte length.

A list of references can be represented as an integer storing list length followed by all the references in their binary form. Let's note that such binary structure is indexed, once we know the index of an element we want to access we can do it immediately.

A list of any fixed-size scalar values can be represented the same way.

A list of variable-size values (e.g. a list of strings) can be represented the following way:

  {strings count}{list of string offsets}{all the strings concatenated}

So, ["a", "bb", "ccc"] would become something like 3 0 2 3 a b bb ccc without spaces.

An important fact is that this encoding is indexed too and it can be reused to store any lists of variable-length data.

TODO: explain the overall EBA structure format, including tables, etc

Additional capabilities over JSON

SICK encoding follows compositional principles of JSON (a set primitive types plus lists and dictionaries), though it is more powerful: it has "reference" type and allows you to encode custom types.

(1) It's easy to note that our table may store circular references, something JSON can't do natively:

Type index Value Is Root
object 0 [string:0, object:1] No
object 1 [string:1, object:0] No

This may be convenient in some complex cases.

(2) Also we may note, that we may happily store multiple json files in one table and have full deduplication over their content. We just need to introduce a separate attribute (is root) storing either nothing or the name of our "root entry" (JSON file).

In real implementation it's more convenient to just create a separate "root" type, the value of a root type should always be a reference to its name and a reference to the actual JSON value we encoded:

Type index Value
string 0 "some key"
string 1 "some value"
string 2 "some value"
object 0 [string:0, string,1]
object 1 [string:1, string:0]
array 0 [object:0, object:0, object:1]
root 0 [string:2, array:0]

(3) We may encode custom scalar data types (e.g. timestamps) natively just by introducing new type tags.

(4) We may even store polymorphic types by introducing new type tags or even new type references.

Currently we provide C# and Scala implementations of SICK indexed binary JSON storage. Currently the code in this repository has no streaming capabilities. That may change in the future. It's not a hard problem to add streaming support, your contributions are welcome.

Language Binary Storage Encoder Binary Storage Decoder Stream Encoder Stream Decoder Encoder AST Decoder AST
Scala Yes Yes No No Circe Circe
C# Yes Yes No No JSON.Net Custom

Also we provide basic JavaScript implementation backed by Scala.JS.

A type marker is represented as a single-byte unsigned integer. The possible values are:

Marker Name Comment Value Length (bytes) C# mapping Scala Mapping
0 TNul Equivalent to null in JSON 4, stored in the marker
1 TBit Boolean 4, stored in the marker
2 TByte Byte, 4, stored in the marker byte (unsigned) Byte (signed)
3 TShort Signed 16-bit integer 4, stored in the marker
4 TInt Signed 32-bit integer 4
5 TLng Signed 64-bit integer 8
6 TBigInt Variable, prefixed
7 TDbl 8
8 TFlt 4
9 TBigDec Variable, prefixed Custom: scale/precision/signum/unscaled quadruple in C#
10 TStr UTF-8 String Variable, prefixed
11 TArr List of array entries Variable, prefixed
12 TObj List of object entries Variable, prefixed
15 TRoot Index of the name string (4 bytes) + reference (4+1=5 bytes) 9

TODO

TODO

Array entries are just references.

TODO

Object entry skip list and KHash

TODO

TODO

Current implementation has the following limitations:

  1. Maximum object size: 65534 keys
  2. The order of object keys is not preserved
  3. Maximum amount of array elements: 2^32
  4. Maximum amount of unique values of the same type: 2^32

These limitations may be lifted by using more bytes to store offset pointers and counts on binary level. Though it's hard to imagine a real application which would need that.

  1. SICK is battle-tested and covered by "good enough" test suite which includes cross-implementation correctness tests (C#<->Scala).
  2. SICK powers several proprietary applications running on mobile devices and in the browser, some of which have large userbases (hundreds of thousands DAU).
  3. No known open source users as of Oct/2025.
  4. More implementations for various platforms are needed, 3rd party implementations are very welcome.
联系我们 contact @ memedata.com