I had to parse a database backup from Firebase, which was, remarkably, a 300GB JSON file. The database is a tree rooted at a single object, which means that any tool that attempts to stream individual objects always wanted to buffer this single 300GB root object. It wasn’t enough to strip off the root either, as the really big records were arrays a couple of levels down, with a few different formats depending on the schema. For added fun our data included some JSON serialised inside strings too.
This was a few years ago and I threw every tool and language I could at it, but they were either far too slow or buffered records larger than memory, even the fancy C++ SIMD parsers did this. I eventually got something working in Go and it was impressively fast and ran on my MacBook, but we never ended up using it as another engineer just wrote a script that read the entire database from the Firebase API record-by-record throttled over several days, lol.
> Utf8JsonReader is a high-performance, low allocation, forward-only reader for UTF-8 encoded JSON text, read from a ReadOnlySpan<byte> or ReadOnlySequence<byte>
Although it's a bit cumbersome to use with a stream [2].
I downloaded a huge Firebase backup looking for a particular record.
I ended up using the “split” shell command to get a bunch of 1gb files, then grepping for which file had the record I was looking for, then using my own custom script to scan outward from the position of matched text until it detected a valid parsable JSON object within the larger unparseable file, and return that.
Back in the bad old days when XML consumers hit similar problems we’d use and event based parser like SAX. I’m a little shocked there isn’t a mainstream equivalent for JSON — is there something I’ve missed?
For JSON, given that large files are generally record-based ndjson is the solution I’ve encountered http://ndjson.org/ and it works nicely with various tools out there using the .ndjson file extension
This was a few years ago and I threw every tool and language I could at it, but they were either far too slow or buffered records larger than memory, even the fancy C++ SIMD parsers did this. I eventually got something working in Go and it was impressively fast and ran on my MacBook, but we never ended up using it as another engineer just wrote a script that read the entire database from the Firebase API record-by-record throttled over several days, lol.