(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=38150833

总的来说,由于资源限制、性能要求和 Web 客户端带来的限制等因素,在分布式系统设置中使用 JSON 似乎带来了独特的挑战。 为了克服这些问题,可以采用包括最小化分配成本、避免动态内存分配以及通过流式解析分摊解析成本的策略。 此外,诸如 Flatbuffers 之类的技术可以提供更高的吞吐量和更低的分配,可以在这方面提供显着的改进。 尽管如此,尽管存在产生乱码或不完整 JSON 的风险,但维护正确的 JSON 语法对于确保可靠解析和准确传输数据仍然至关重要。 最后,宽容的 JSON 解析器/词法分析器可能会成为处理部分处理的 JSON 字符串的有价值的工具。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Building a high performance JSON parser (cheney.net)
437 points by davecheney 18 hours ago | hide | past | favorite | 151 comments










Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.

That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)



What line of work are you in that you've "written far too many JSON parsers already" in your career?!!!


Reasons differ. C++ is a really hard place to be. It's gotten better, but if you can't tolerate exceptions, need code that is as-obviously-memory-safe-as-possible, can parse incrementally (think SAX style), off-the-shelf options like jsoncpp may not fit the bill.

Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.

I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.



The large documents are often fixed by using mmap/virtualalloc of the file, but Boost.JSON has a streaming mode and is reasonably fast and the license is good for pulling into anything. It's not the fastest, but faster than rapid with the interface of nlohmann JSON. For most tasks, it does seem that most of hte libraries taking a JSON document approach are wasting a lot of time/memory to get to the point that we want normal data structures, not a JSON document tree. If we pull that out and parse straight to the data structures there is a lot of win in performance and memory with less/no code, just mappings. That's how I approached it at least.


> that most of hte libraries taking a JSON document approach are wasting a lot of time/memory

I agree. That's the same situation as with XML/HTML. In many cases you don't really need to build a DOM or JSOM in memory. If your task is about deserializing some native structures.

This XML scanner of mine does not allocate any memory at all while parsing HTML/XML: https://www.codeproject.com/Articles/14076/Fast-and-Compact-...

It is even simpler than SAX parser.



For the interesting JSON of a significant size, an interator/range interface that parses to concrete types works really well. Usually they are large arrays or JSONL like things


I've seen "somebody doesn't agree with the standard and we must support it" way too many times, and I've written JSON parsers because of this. (And, of course, it's easy to get some difference with the JSON standard.)

I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).

There are probably lots and lots of reasons to write your own parser.



This reminds me of my favourite quote about standards.

>The wonderful thing about standards is that there are so many of them to choose from.

And, keeping with the theme, this quote may be from Grace Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.



Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.


What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

How big is 'large' here?

I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.

I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?



There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either


Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.


Someone misunderstood the JSONParserFactory somewhere along the line.


I've written JSON parsers because in one instance we had to allow users to keep their formatting but also edit documents programmatically. At the time I couldn't find parsers that did that, but it was a while back.

In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).

With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".



A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.

His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.

Fast JSON is that important these days.



The walkthrough is very nice, how to do this if you're going to do it.

If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.



If your JSON always looks the same you can also do better than general JSON parsers.


Andreas Fredriksson demonstrates exactly that in this video: https://vimeo.com/644068002


I really enjoyed this video even though he lost me with the SIMD code.


I like this video because there's a lot of a good actionable advice before he gets into SIMD code.


Very enjoyable video!


You might also move to something other than JSON if parsing it is a significant part of your workload.


Most of the times I’ve had to deal with JSON performance issues, it involved a 3rd party API and JSON was the only option.

If you’re building something net-new and know you’ll have these problems out the gate, something other than JSON might be feasible, but the moment some other system not in the closed loop needs to work with the data, you’re back to JSON and any associated perf issues.



I wonder: can fast, special-case JSON parsers be dynamically autogenerated from JSON Schemas?

Perhaps some macro-ridden Rust monstrosity that spits out specialised parsers at compile time, dynamically…



For json schema specifically there are some tools like go-jsonschema[1] but I've never used them personally. But you can use something like ffjson[2] in go to generate a static serialize/deserialize function based on a struct definition.

[1] https://github.com/omissis/go-jsonschema [2] https://github.com/pquerna/ffjson



Hey, go-jsonschema is my project. (Someone else just took over maintaining it, though.) It still relies on the standard Go parser; all it does it generate structs with the right types and tags.


Doesn't the serde crate's json support do precisely this? It generates structs that have optional in all the right places and with all the right types anyway. Seems like the llvm optimiser can probably do something useful with that even if the serde feature isn't using apriori knowledge out of the schema.


A fundamental problem with JSON parsing is that it has variable length fields that don't encode their length, in a streaming scenario you basically need to keep resizing your buffer until the data fits. If the data is on disk and not streaming you may get away with reading ahead to find the end of the field first, but that's also not particularly fast.

Schemas can't fix that.



Only if you are using pointers/slices into the buffer as an optimisation.

Otherwise there is no need to keep a buffer of anything after it has been parsed.



I'm talking about during parsing.

Let's assume I send you a JSON object that is one very long string and nothing else. It's e.g. 1 GB in size. To know you need to allocate a 1GB buffer, you need to first scan it, and then copy it; or keep reallocating the same buffer until it fits.

It's an absurd case, but shorter strings face similar overhead.



It's relatively common in D application to use the compile time capabilities to generator a parser at compile time


Somewhat tangentially related, Fabian Iwand posted this regex prefix tree visualiser/generator last week [0], which may offer some inspiration for prototyping auto generated schemas.




You forgot to include the link?


Pydantic does that to some extend I think.


Last time I compared the performance of various json parsers the simd one turned out to be disappointingly slow.


The fastest json lib in Go is the one done by the company behind Tiktok.




Fastest at what?


> For all sizes of json and all scenarios of usage, Sonic performs best.

The repository has benchmarks



I’m not seeing simdjson in them though? I must be missing something because the Go port of it is explicitly mentioned in the motivation[1] (not the real thing, though).

[1] https://github.com/bytedance/sonic/blob/main/docs/INTRODUCTI...



Excellent treat vector.


Excellent treat vector.


simdjson has not been the fastest for a long long time


What is faster? According to https://github.com/kostya/benchmarks#json nothing is.


I've recently held a talk (https://youtu.be/a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk) about github.com/romshark/jscan that I've been working on. It's a performance-oriented JSON iterator / tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.


My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:

Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.

Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.

For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.

My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.

If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.

Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.



In n2[1] I needed a fast tokenizer and had the same "garbage factory" problem, which is basically that there's a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.

I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.

In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.

[1] https://github.com/evmar/n2



I've taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that's also zero memory allocations and quite fast. In case you'd like to check out the code: https://github.com/wundergraph/graphql-go-tools


How big of an issue is this for GQL servers where all queries are known ahead of time (allowlist) - i.e. you can cache/memorize the ast parsing and this is only a perf issue for a few minutes after the container starts up

Or does this bite us in other ways too?



I build GraphQL API gateways / Routers for 5+ years now. It would be nice if trusted Documents or persisted operations were the default, but the reality is that a lot of people want to open up their GraphQL to the public. For that reason we've build a fast parser, validator, normalizer and many other things to support these use cases.


what does this bring over goccy's json encoder?


It's possible to improve over the standard library with better API design, but it's not really possible to do a fully streaming parser that doesn't half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.


Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic


I remember this JSON benchmark page from RapidJSON [1].

[1] https://rapidjson.org/md_doc_performance.html



"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications


Most applications read JSONs from networks, where you have a stream. Buffering and fiddling with the whole request in memory increases latency by a lot, even if your JSON is smallish.


Most(most) JSON payloads are probably much smaller than many buffer sizes so just end up all in memory anyway.


On a carefully built WebSocket server you would ensure your WebSocket messages all fit within a single MTU.


Yes but for applications where you need to do ETL style transformations on large datasets, streaming is an immensely useful strategy.

Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.



If performance is important why would you keep large datasets in JSON format?


Because you work at or for some bureaucratic MegaCorp, that does weird things with no real logic behind it other than clueless Dilbert managers making decisions based on LinkedIn blogs. Alternatively desperate IT consultants trying to get something to work with too low of a budget and/or no access to do things the right way.

Be glad you have JSON to parse, and not EDI, some custom deliminated data format (with no or old documentation) - or shudders you work in the airline industry with SABRE.



sometimes it's not your data


Usually because the downstream service or store needs it


If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.



I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.


RAM is always expensive no matter where you get it from.

Would you rather do two hours of work or force thousands of people to buy more RAM because your library is a memory hog?

And on embedded systems RAM is a premium. More RAM = most cost.



Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?

I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).



> If so, how do you deal with repeated keys in "hash tables"?

depending on the parser, behaviour might differ. But looking at https://stackoverflow.com/questions/21832701/does-json-synta... , it seems like the "best" option is to have 'last key wins' as the resolution.

This works fine under a SAX like interface in a streaming JSON parser - your 'event handler' code will execute for a given key, and a 2nd time for the duplicate.



last key wins is terrible advice and has serious security implications.

see https://bishopfox.com/blog/json-interoperability-vulnerabili... or https://www.cvedetails.com/cve/CVE-2017-12635/ for concrete examples where this treatment causes security issues.

the https://datatracker.ietf.org/doc/html/rfc7493 defines a more strict format where duplicate keys are not allowed.



If you can live with "fits on disk" mmap() is a viable option? Unless you truly need streaming (early handling of early data, like a stream of transactions/operations from a single JSON file?)


In general, JSON comes over the network, so MMAP won't really work unless you save to a file. But then you'll run out of disk space.

I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?





"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."

From 1989:

https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...

"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."



You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.


I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D

So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/

It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)



> I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

> I don't know how "true" that comment is

Either way it's a good way to get a pair of quadratic loops in your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...



I wrote a small JSON parser in C myself which I called jsoncut. It just cuts out a certain part of a json file. I deal with large JSON files, but want only to extract and parse certain parts of it. All libraries I tried parse everything, use a lot of RAM and are slow.

Link here, if interested to have a look: https://github.com/rgex/jsoncut



The words you’re looking for are SAX-like JSON parser or streaming json parser. I don’t know if there’s any command line tools like the one you wrote that use it though to provide a jq-like interface.


I tried JQ and other command line tools, all were extremely slow and seemed to always parse the entire file.

My parser just reads the file byte by byte until it finds the target, then outputs the content. When that's done it stops reading the file, meaning that it can be extremely fast when the targeted information is at the beginning of the JSON file.



You're still describing a SAX parser (i.e. streaming). jq doesn't use a SAX parser because it's a multi-pass document editor at its core, hence why I said "jq-like" in terms of supporting a similar syntax for single-pass queries. If you used RapidJSON's SAX parser in the body of your custom code (returning false once you found what you're looking for), I'm pretty sure it would significantly outperform your custom hand-rolled code. Of course, your custom code is very small with no external dependencies and presumably fast enough, so tradeoffs.


I guess there are only so many ways to write a JSON parser b cause one I wrote on a train in Python looks very similar!

I thought it would be nice and simple but it really was still simpler than I expected. It's a fantastic spec if you need to throw one together yourself, without massive performance considerations.



Good for you but what does this have to do with the article?


This is fantastically useful.

Funny enough I stumbled upon your article just yesterday through google search.



Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb


Can someone explain to me why JSON can't have comments or trailing commas? I really hope the performance gains are worth it, because I've lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:

  "IMPORTANT: do not run the scripts below this line, they are for CICD only": true,


It can't have comments because it didn't originally had comments, so now it's too late. And it originally didn't have comments, because Douglas Cockford thought they could be abused for parsing instructions.

As for not having trailing commas, it's probably a less intentional bad design choice.

That said, if you want commas and comments, and control the parsers that will be used for your JSON, then use JSONC (JSON with comments). VSCode for example does that for its JSON configuration.



JSONC also supports trailing commas. It is, in effect, "JSON with no downsides".

TOML/Yaml always drive me batty with all their obscure special syntax. Whereas it's almost impossible to look at a formatted blob of JSON and not have a very solid understanding of what it represents.

The one thing I might add is multiline strings with `'s, but even that is probably more trouble than it's worth, as you immediately start going down the path of "well let's also have syntax to strip the indentation from those strings, maybe we should add new syntax to support raw strings, ..."



Amusingly, it originally did have comments. Removing comments was the one change Crockford ever made to the spec[1].

[1] https://web.archive.org/web/20150105080225/https://plus.goog... (thank you Internet Archive for making Google’s social network somewhat accessible and less than useless)



Does JSONC have a specification or formal definition? People have suggested[1] using JSON5[2] instead for that reason

[1] https://github.com/microsoft/vscode/issues/100688

[2] https://spec.json5.org/



Unfortunately, JSON5 says keys can be ES5 IdentifierName[1]s, which means you must carry around Unicode tables. This makes it a non-option for small devices, for example. (I mean, not really, you technically could fit the necessary data and code in low single-digit kilobytes, but it feels stupid that you have to. Or you could just not do that but then it’s no longer JSON5 and what was the point of having a spec again?)

[1] https://es5.github.io/x7.html#x7.6



Or take the SQLite route, if you don't need to reject strictly invalid JSON5. When they extended SQLite's JSON parser to support JSON5, they specifically relaxed the definition of unquoted keys (in a compatible way) to avoid Unicode tables[1]:

> Strict JSON5 requires that unquoted object keys must be ECMAScript 5.1 IdentifierNames. But large unicode tables and lots of code is required in order to determine whether or not a key is an ECMAScript 5.1 IdentifierName. For this reason, SQLite allows object keys to include any unicode characters greater than U+007f that are not whitespace characters. This relaxed definition of "identifier" greatly simplifies the implementation and allows the JSON parser to be smaller and run faster.

[1]: https://www.sqlite.org/json1.html



Yeah, definitely for small devices. For things like VS Code's configuration file format (the parent comment) or other such use cases, I don't see a problem


If you want commas and comments, another term to search for is JWCC, which literally stands for "JSON With Commas and Comments". HuJSON is another name for it.


Yeah if you could get NPM to allow JSONC in package.json, that'd be great.


It’s not that it “can’t”, more like it “doesn’t”. Douglas Crockford prioritized simplicity when specifying JSON. Its BNF grammar famously fits on one side of a business card.

Other flavors of JSON that include support for comments and trailing commas exist, but they are reasonably called by different names. One of these is YAML (mostly a superset of JSON). To some extent the difficulties with YAML (like unquoted ‘no’ being a synonym for false) have vindicated Crockford’s priorities.



I don't know the historic reason why it wasn't included in the original spec, but at this point it doesn't matter. JSON is entrenched and not going to change.

If you want comments, you can always use jsonc.



There's no real reason for that. It just happened like that. These aren't the only bad decisions made by JSON author, and not the worst either.

What you can do is: write comments using pound sign, and rename your files to YAML. You will also get a benefit of a million ways of writing multiline strings -- very confusing, but sometimes useful.



These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.

The most important thing, though, is the process: measure then optimize.



How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson


Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.




I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.

The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.



Of course you can force-inline.


Obviously you can manually inline functions. That's what happened in the article.

The comment is about having a directive or annotation to make the compiler inline the function for you, which Go does not have. IMO, the pre-inline code was cleaner to me. It's a shame that the compiler could not optimize it.

There was once a proposal for this, but it's really against Go's design as a language.

https://github.com/golang/go/issues/21536



You can in any systems programming language.

Go is mostly a toy language for cloud people.



> toy language

You may be surprised to hear that Go is used in a ton of large scale critical systems.



I don't consider cloud technology a critical system.


My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312

Note the distinct lack of:

    if err != nil {


Maybe I overlooked something, but the author keeps repeating that they wrote a "streaming" parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in "hash tables". What does their parser do? Calls the "sink" code twice with the repeated key? Waits until the entire "hash table" is red and then calls the "sink" code?

In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.



> Any (useful) JSON decoder code cannot go faster that this.

That line feels like a troll. Cunningham’s Law in action.

You can definitely go faster than 2 Gb/sec. In a word, SIMD.



we could re-frame by distinguishing problem statements from implementations

Problem A: read a stream of bytes, parse it as JSON

Problem B: read a stream of bytes, count how many bytes match a JSON whitespace character

Problem B should require fewer resources* to solve than problem A. So in that sense problem B is a relaxation of problem A, and a highly efficient implementation of problem B should be able to process bytes much more efficiently than an "optimal" implementation of problem A.

So in this sense, we can probably all agree with the author that counting whitespace bytes is an easier problem than the full parsing problem.

We're agreed that the author's implementation (half a page of go code that fits on a talk slide) to solve problem B isn't the most efficient way to solve problem B.

I remember reading somewhere the advice that to set a really solid target for benchmarking, you should avoid measuring the performance of implementations and instead try to estimate a theoretical upper bound on performance, based on say a simplified model of how the hardware works and a simplification of the problem -- that hopefully still captures the essence of what the bottleneck is. Then you can compare any implementation to that (unreachable) theoretical upper bound, to get more of an idea of how much performance is still left on the table.

* for reasonably boring choices of target platform, e.g. amd64 + ram, not some hypothetical hardware platform with surprisingly fast dedicated support for JSON parsing and bad support for anything else.



Everything you said is totally reasonable. I'm a big fan of napkin math and theoretical upper bounds on performance.

simdjson (https://github.com/simdjson/simdjson) claims to fully parse JSON on the order of 3 GB/sec. Which is faster than OP's Go whitespace parsing! These tests are running on different hardware so it's not apples-to-apples.

The phrase "cannot go faster than this" is just begging for a "well ackshully". Which I hate to do. But the fact that there is an existence proof of Problem A running faster in C++ SIMD than OP's Probably B scalar Go is quite interesting and worth calling out imho. But I admit it doesn't change the rest of the post.



[flagged]



> “Json" and "Go" seem antithetical in the same sentence as "high performance" to me

Absolute highest performance is rarely the highest priority in designing a system.

Of course we could design a hyper-optimized, application specific payload format and code the deserializer in assembly and the performance would be great, but it wouldn’t be useful outside of very specific circumstances.

In most real world projects, performance of Go and JSON is fine and allow for rapid development, easy implementation, and flexibility if anything changes.

I don’t think it’s valid to criticize someone for optimizing within their use case.

> I feel like it should deserve a mention of what the stack is, otherwise there is no reference point.

The article clearly mentions that this is a GopherCon talk in the header. It was posted on Dave Cheney’s website, a well-known Go figure.

It’s clearly in the context of Go web services, so I don’t understand your criticisms. The context is clear from the article.

The very first line of the article explains the context:

> This talk is a case study of designing an efficient Go package.



Even sticking within the confines of json there's low hanging fruit, e.g. if your data is typically like:

    [{"k1": true, "k2": [2,3,4]}, {"k1": false, "k2": []}, ...]
You can amortize the overhead of the keys by turning this from an array of structs (AoS) into a struct of arrays (SoA):

    {"k1": [true, false, ...], "k2": [[2,3,4], [], ...]}
Then you only have to read "k1" and "k2" once, instead of once per record. Presumably there will be the odd record that contains something like {"k3": 0} but you can use mini batches of SoA and tune their size according to your desired latency/throughput tradeoff.

Or if your data is 99.999% of the time just pairs of k1 and k2, turn them into tuples:

    {"k1k2": [true,[2,3,4],false,[], ...]}
And then 0.001% of the time you send a lone k3 message:

    {"k3": 2}
Even if your endpoints can't change their schema, you can still trade latency for throughput by doing the SoA conversion, transmitting, then converting back to AoS at the receiver. Maybe worthwhile if you have to forward the message many times but only decode it once.


This is an article about optimizing a JSON parser in Go.

As always, try to remember that people usually aren't writing (or posting their talks) specifically for an HN audience. Cheney clearly has an audience of Go programmers; that's the space he operates in. He's not going to title his post, which he didn't make for HN, just to avoid a language war on the threads here.

It's our responsibility to avoid the unproductive language war threads, not this author's.



JSON is often the format of an externally provided data source, and you don't have a choice.

And whatever language you're writing in, you usually want to do what you can to maximize performance. If your JSON input is 500 bytes it probably doesn't matter, but if you're intaking a 5 MB JSON file then you can definitely be sure the performance does.

What more do you need to know about "the stack" in this case? It's whenever you need to ingest large amounts of JSON in Go. Not sure what could be clearer.



"Antiethical".

Is it true that mindless bashing Go became some kind of a new religion in some circles?



Exactly, it's not too hard to implement in C. The one I made never copied data, instead saved the pointer/length to the data. The user only had to Memory Map the file (or equivalent), pass that data into the parse. Only memory allocation was for the Jason nodes.

This way they only paid the parsing tax (decoding doubles, etc..) if the user used that data.

You hit the nail on the head



The linked article is a GopherCon talk.

The first line of the article explains the context of the talk:

> This talk is a case study of designing an efficient Go package.

The target audience and context are clearly Go developers. Some of these comments are focusing too much on the headline without addressing the actual article.



Yup and if your implementation uses a hashmap for object key -> value lookup, then I recommend allocating the hashmap after parsing the object not during to avoid continually resizing the hashmap. You can implement this by using an intrusive linked list to track your key/value JSON nodes until the time comes to allocate the hashmap. Basically when parsing an object 1. use a counter 'N' to track the number of keys, 2. link the JSON nodes representing key/value pairs into an intrusive linked list, 3. after parsing the object use 'N' to allocate a perfectly sized hashmap in one go. You can then iterate over the linked list of JSON key/value pair nodes adding them to the hashmap. You can use this same trick when parsing JSON arrays to avoid continually resizing a backing array. Alternatively, never allocate a backing array and instead use the linked list to implement an iterator.


Like how https://github.com/zserge/jsmn works. I thought it would be neat to have such as parser for https://github.com/vshymanskyy/muon


I've made a JSON parser which works like this too. No dynamic memory allocations, similar to the JSMN parser but stricter to the specification.

Always nice to be in control over memory :)

https://sr.ht/~cryo/cj



> The user only had to Memory Map the file (or equivalent)

Having done this myself, it's a massive cheat code because your bottleneck is almost always i/o and memory mapped i/o is orders of magnitude faster than sequential calls to read().

But that said it's not always appropriate. You can have gigabytes of JSON to parse, and the JSON might be available over the network, and your service might be running on a small node with limited memory. Memory mapping here adds quite a lot of latency and cost to the system. A very fast streaming JSON decoder is the move here.



> memory mapped i/o is orders of magnitude faster than sequential calls to read()

That’s not something I’ve generally seen. Any source for this claim?

> You can have gigabytes of JSON to parse, and the JSON might be available over the network, and your service might be running on a small node with limited memory. Memory mapping here adds quite a lot of latency and cost to the system

Why does mmap add latency? I would think that mmap adds more latency for small documents because the cost of doing the mmap is high (cross CPU TLB shoot down to modify the page table) and there’s no chance to amortize. Relatedly, there’s minimal to no relation between SAX vs DOM style parsing and mmap - you can use either with mmap. If you’re not aware, you do have some knobs with mmap to hint to the OS how it’s going to be used although it’s very unwieldy to configure it to work well.



Experience? Last time I made that optimization it was 100x faster, ballpark. I don't feel like benchmarking it right now, try yourself.

The latency comes from the fact you need to have the whole file. The use case I'm talking about is a JSON document you need to pull off the network because it doesn't exist on disk, might not fit there, and might not fit in memory.



> Experience? Last time I made that optimization it was 100x faster, ballpark. I don't feel like benchmarking it right now, try yourself.

I have. Many times. There's definitely not a 100x difference given that normal file I/O can easily saturate NVMe throughput. I'm sure it's possible to build a repro showing a 100x difference, but you have to be doing something intentionally to cause that (e.g. using a very small read buffer so that you're doing enough syscalls that it shows up in a profile).

> The latency comes from the fact you need to have the whole file

That's a whole other matter. But again, if you're pulling it off the network, you usually can't mmap it anyway unless you're using a remote-mounted filesystem (which will add more overhead than mmap vs buffered I/O).



I think you misunderstood my point, which was to highlight exactly when mmap won't work....


In my experience mmap is at best 50% faster compared to good pread usage on Linux and MacOS.


I also really like this paradigm. It’s just that in old crusty null-terminated C style this is really awkward because the input data must be copied or modified. But it’s not an issue when using slices (length and pointer). Unfortunately most of the C standard library and many operating system APIs expect that.

I’ve seen this referred to as a pull parser in a Rust library? (https://github.com/raphlinus/pulldown-cmark)



Did you publish the code somewhere? I'd be interested in reading it.


JSON is probably the fastest serialization format to produce and parse, which is also safe for public use, compared to binary formats which often have fragile, highly specific and vulnerable encoding as they're directly plopped into memory and used as-is (i.e. they're not parsed at all, it's just two computers exchanging memory dumps).

Compare it with XML for example, which is a nightmare of complexity if you actually follow the spec and not just make something XML-like.

We have some formats which try to walk the boundary between safe/universal and fast like ASN.1 but those are obscure at best.



I prefer msgpack if the data contains a lot of numeric values. Representing numbers as strings like in JSON can blow up the size and msgpack is usually just as simple to use.


Did you even open the article? Following is literally in first paragraph

> This package offers the same high level json.Decoder API but higher throughput and reduced allocations



How does that contradict what the parent poster says? I think it's very weird to call something "high performance" when it looks like it's maybe 15-20% of the performance of a simdjson in c++. This is not "going from normal performance to high performance", this going from "very subpar" to "subpar"


Par is different for different stacks. It's reasonable for someone to treat their standard library's JSON parser as "par", given that that's the parser that most of their peers will be using, even if there are faster options that are commonly used in other stacks.


Ok but how many teams are building web APIs in C++?


I worked with a guy who did this. It was fast, but boy howdy was it not simple.


Because even in C++ people don't use json simd most project use rapidjson which Go is on part with.


> "Json" and "Go" seem antithetical in the same sentence as "high performance" to me. As long as we are talking about _absolute performance_.

Even just "Json" is problematic here as wire protocol for absolute performance no matter what will be programming language.



I think people say that as they give disproportional weight to the fact it's text-based, while ignoring how astoundingly simple and linear it is to write and read.

The only way to nudge the needle is to start exchanging direct memory dumps, which is what ProtoBuff and the like do. But this is clearly only for very specific use.



> while ignoring how astoundingly simple and linear it is to write and read.

code maybe simple, but you have lots of performance penalties: resolving field keys, you need to construct some complicated data structures through memory allocations, which is expensive.

> to start exchanging direct memory dumps, which is what ProtoBuff and the like do

Protobuff actually is doing parsing, it is just binary format. What you describing is more like Flatbuffer.

> But this is clearly only for very specific use.

yes, specific use is high performance computations )



Resolving what keys? JSON has keyval sequences ("objects") but what you do with them is entirely up to you. In a streaming reader you can do your job without ever creating maps or dehydrating objects.

Plus no one makes people use objects in JSON. If you can send a tuple of fields as an array... then send an array.



> In a streaming reader you can do your job without ever creating maps or dehydrating objects.

but you need some logic which will check that key is what you want to process, and also do various type transformation (e.g. json string to long).

> Plus no one makes people use objects in JSON. If you can send a tuple of fields as an array... then send an array.

then this shouldn't be called JSON parsing anymore.



nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?


Perhaps not quite what you're asking for, but along the same lines there's this "Incomplete JSON" parser, which takes a string of JSON as it's coming out of an LLM and parses it into as much data as it can get. Useful for building streaming UI's, for instance it is used on https://rexipie.com quite extensively.

https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...

Some example test cases:

    { input: '[{"a": 0, "b":', output: [{ a: 0 }] },
    { input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },

    { input: "[{},", output: [{}] },
    { input: "[{},1", output: [{}, 1] },
    { input: '[{},"', output: [{}, ""] },
    { input: '[{},"abc', output: [{}, "abc"] },
Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.

Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.



I feel like trying to infer valid JSON from invalid JSON is a recipe for garbage. You’d probably be better off doing a second pass with the “JSON” through the LLM but, as the sibling commenter said, at this point even the good JSON may be garbage …


The jsonrepair tool https://github.com/josdejong/jsonrepair might interest you. It's tailored to fix JSON strings.

I've been looking into something similar for handling partial JSONs, where you only have the first n chars of a JSON. This is common with LLM with streamed outputs aimed at reducing latency. If one knows the JSON schema ahead, then one can start processing these first fields before the remaining data has fully loaded. If you have to wait for the whole thing to load there is little point in streaming.

Was looking for a library that could do this parsing.



See my sibling comment :)


If the LLM did such a bad job that the syntax is wrong, do you really trust the data inside?

Forgiving parsers/lexers are common in language compilers for languages like rust or C# or typescript, you may want to investigate typescript in particular since it's applicable to JSON syntax. Maybe you could repurpose their parser.



halloween was last week






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com