The size of the keys isn't even the problem. The numeric value 1 is 1 byte, but ...

jiggawatts · on March 19, 2023

You're all massively underestimating the memory requirements of a parsed data structure, which is very entertaining to watch. There's this rule-of-thumb that most programmers are wrong about performance costs by at least a couple of orders of magnitude, and it's on full display here.

In most languages, dynamically allocated variable-sized objects have a minimum overhead of two pointer-sized fields: a pointer and a length. On a 64-bit platform, that's 16 bytes before you actually start having any data.

Next, depending on the allocator used, the requested data size is probably rounded up, either to the next power of 2, or there may be a minimum allocation size of 1 pointer. That's the compatible safe approach, and also has some performance advantages on most platforms.

Last but not least, most heap-based allocators have some "bookkeeping" overheads. Similarly, interpreted languages also have their internal "object" metadata overheads. Typically this is 1 or 2 extra pointers (+8 or +16 bytes).

Assuming 24-48 bytes for all variable-length values is actually a pretty safe bet!

There are exceptions:

Unusually, C skips the 'length' value for strings by using null-termination, so C strings are often just 16 bytes (8 for the pointer, and 8 for the allocated object on the heap, assuming some sort of small-object optimisation is going on).

The approach in C++ is to use "small string optimisation" where the string values are inlined into the std::string structure itself. This works up to 23 bytes packed into the 24-byte string structure. There's an awesome CppCon presentation on how Andrei Alexandrescu did this optimisation at Facebook: https://www.youtube.com/watch?v=kPR8h4-qZdk

Interpreted or "VM" languages like JavaScript, C# or Java are far worse than this. For one, they convert UTF8 to UTF16, doubling the bytes required per character for typical "ASCII" identifiers. JavaScript converts integers into 64-bit floats. Java has weird overheads for all objects. Etc...

Update: I just did an experiment with .NET 6

    Allocating ~1 billion characters as 100M strings
        800,000,056 bytes (0.7 GB) for holding the strings.
     14,021,088,768 bytes (13.1 GB) for the strings themselves
     14,821,216,680 bytes (13.8 GB) total

Unsurprisingly, simply "referencing" (holding on to) the strings needs an 8-byte pointer per string. The actual strings hold 1-20 characters randomly, but require 140 bytes in memory on average. This bloats out the original 1 GB to just under 14 GB in memory.

fegu · on March 19, 2023

I have been hit by this many times. Do you know of any good resources on how to reduce this overhead? (Choosing other data structures, choosing other string types etc)

jiggawatts · on March 19, 2023

There's a whole range of techniques! Many of them are language dependent, and won't be applicable elsewhere.

A blunt instrument is running your code in 32-bit mode, which halves pointer sizes. The downside is it also limits your maximum data size, and blocks the use of the 64-bit instruction sets that generally speed up data processing.

Java has an interesting hybrid mode where it uses 32-bit pointers but runs in 64-bit mode.

Not storing the entire decoded document all at once is usually the recommended approach, via "streaming" parsers that give you one element at a time. This then lets you immediately throw away data that you're done with processing.

The downside is that it makes certain common idioms impossible or difficult. For example, MVC web architectures generally assume that the controller produces a "complete" model object that then gets passed to the view.

Unfortunately, language-level support for efficient data processing is generally lacking. Rust and C++ are okay, but have annoying gaps in their capabilities.

A common trick with something like parsing is to keep the original encoded string to be decoded "as-is", and then simply reference into it using integer indexes. I.e.: don't copy the strings out individually, instead just use "slices" into the original document string.

Most commonly, unique strings are detected using a hash table and not stored separately. This works better with garbage-collecting languages like C#, Java, and JavaScript. With Rust or C++ you have to use reference counting or other tricks.

antonvs · on March 19, 2023

In the text representation of JSON, the string “x” is three characters. Short of using a statically-typed length, C’s system is about as efficient as it gets in terms of space overhead.

coldtea · on March 19, 2023

You'd also need to add the overhead of memory representation as jiggawatts says, so you'd need to start from 8 bytes just for the pointer alone (more than double that 3 byte textual representation)! And it's not gonna just allocate one byte (unless you go something fancy with your allocator), but 8 bytes for the string, 2 of which would be the "x" and the null terminator. So 3 -> 16 for a single 1 byte value.

And that's in C, where the length calculations and the type management is up to you! It gets worse soon...

antonvs · on March 20, 2023

There’s a weird sort of apples to oranges comparison going on here. Having pointers to separately allocated strings gives you much more functionality than an unindexed blob of JSON text. That obviously has a cost in terms of memory.

But if you need to, you can represent the same data as a contiguous block of C-style formatted structures, in which case the “x” wouldn’t need more than two or three bytes depending on the representation you choose. In fact, there are formats designed for this exact purpose. The idea that JSON is somehow surprisingly efficient for what it does is just silly.