I have been bitten by this many times. Null character in strings, as weird as it sounds like, happens in real life. It just happens.
The pattern I have seen the most is logs. More often than not logs may embed binary chunks, which may contain the infamous null character. However, the whole log itself is a string. For example, this happens on web server logs (binary data tends to be attack attempts).
Because (obviously) this doesn't happen on PostgreSQL databases, I have experienced this while migrating from either Oracle or MongoDB to PostgreSQL. Nulls in both. One on logs, the other one as part of a user name (!!!). But still there.
PostgreSQL must support this. It is long due and required, specially to help database migrations. The author is clearly on point.
A log containing binary chunks is not a string, because it will probably contain invalid characters, at least if your string encoding has the concept of invalid characters like UTF8 has. Since logs usually are not in any fixed encoding but in whatever the currently running process uses, even the string parts of your log are unlikely to always conform to UTF8. So blob would be the appropriate data type anyways.
Logs, unless specifically being told otherwise, are assumed to be in text format. Even if they may contain binary blobs (which are normally unexpected, and not really conformant), they are still supposed to be text format.
Surely you can store anything on a bytea. However, that won't take you very far if you want to query the logs (partially or completely) from the database --which is indeed extremely useful.
In any case, Postgres always strived for compliance. And here, it is not accepting a UTF-8 valid character, hence is being non-compliant.
I switched some in-house database written in C to a variant of UTF-8b where the NUL byte is also surrogate escaped, mostly because of pervasive NUL termination issues as mentioned in the OP.
Escaping NULs wasn't the only reason we chose that encoding. UTF-8b correctly roundtrips not-quite-UTF8 data instead of erroring out. That's useful when you don't have full control over the data generation process, or expect rare corruption (e.g., in log data or catastrophic error reports).
A UTF-8 column type is not just a storage choice, it's also a constraint on your data... if it doesn't make sense to forbid NULs, does it make sense to demand that the data be well-formed UTF-8?
Surrogate escaping / UTF-8b is kinda genius. Unicode has two reserved ranges of 256 values each ("surrogates") reserved for the UTF-16 encoding, which are used there for encoding stuff outside the BMP. Surrogate escaping realizes that 1) These never appear in valid UTF-8, because UTF-8 is a variable-length integer, and doesn't need surrogates 2) So you can use them to pass-through binary data in UTF-8.
I have a large scale web analysis platform and the amount of effort I’ve had to put in to handling this is immense. You don’t appreciate how good browsers are at handling malformed code until you try to ingest tens of millions of pages and scripts. I switched to Postgres partially to get away from MySQL really poor handling of emoji, and got this bug in place.
Unicode standard has made bad choices over the years. "Because it's valid Unicode" is not a good reason to do anything. It's better to detect errors in strings sooner than later.
If you want to carry any valid UTF-8 string, you can as well treat them as binary blobs and solve the issue.
Nothing. If you accept only workable specific subset of Unicode and validate input, you can get the point back. Using utf-8 encoding for text is almost always the right ting to do, treating it as generic Unicode string is usually wrong.
Unicode is as complex as UTF-8 encoding is simple an neat.
Joel Spolsky studied Unicode and was so sure that he understood it that he wrote a short intro for programmers without knowing that code points in Unicode don't correspond to "platonic ideal letters" aka user perceived characters. They correspond only in every case he used them (user-perceived character may actually be a sequence of code points).
Yeah. There's probably all kind of obscure use cases for ASCII, but generally, any text meant to be human readable should be Unicode encoded, whether you call it "string" or "blob".
In the case of Postgres, you need string types if you intend to do any string operations on those strings in the database. As mentioned in the article, you also have no choice if you want to use jsonb.
The same as the argument for allowing BEL, DEL, BOM, and any other control codes or special characters: implementing a standard as described, with minimal divergence or unexpected behaviour, to ensure compatibility with arbitrary external software.
I think the argument was just “text is incredibly complex with tons of obscure rules once you accept not-English-Latin text, so it’s best not to assume anything and just go by the standard.”
While I agree with you that this is one of the arguments the author puts forward, it does little to convince me.
I can support that line of thought if we want to talk, for instance, about why Rust doesn't provide a "proper" iteration over characters [1] in their standard library. But in this case the Null character is an artificial construct that does not naturally exist in any language on Earth.
I sincerely doubt someone will ever find a naturally-ocurring "John \0 Doe". What I don't doubt is that such a name occurrs in a badly-migrated database somewhere, but that's a different discussion.
Actually I can think of one place I know I've seen explicit null characters in text, and that's a Facebook Business Record pdf. If you download your FB data the text instructions of all the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER!
I found it extremely annoying at first cause I was trying to copy/paste the stream chunks around and it wouldn't copy anything after the fist null. Then I realized this was probably a security hack in the hopes that people couldn't copy the data around (I can't think of any other reason to add these nulls like this otherwise). Funny enough, I opened the PDF in chrome and copy/paste of the selected text works fine. So clearly some readers strip these bad characters, but I can imagine others might not.
It's only the text instructions that have this, not the rest of the text. ie one line of the content looks like this, where it's trying to write the text 'Service'
BT 0 Tr 0.000000 w ET BT 44.814370 775.487087 Td [(\0S\0e\0r\0v\0i\0c\0e)] TJ ET
But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.
And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.
As weird and inefficient as it may seem at first glance, JSON in a database can have a variety of uses, so I certainly see the benefit.
But… is there any use for a null byte explicitly in a JSON string? The de facto standard (easy to use and instantly recognisable) for blobs seems to be base64. I can’t think of any meaningful data benefits other than data efficiency (which is easily mitigated by storing the blob directly in the database).
That second one is probably the more important one. A null byte in plain text is probably an error. A null byte in a json string may be intentional if perhaps questionable. JSON strings are frequently abused for various transport formats.
If I was responsible for ingesting and returning JSON strings from a server, my attidue towards people who require nulls in them would be rather juvenile and brash - "lmao fuck em".
json carrying important and meaningful data in strings (i.e. blobs in strings) is usually just an exploit on the non-conformance of parsers. JSON strings are defined by ECMA-404 to be UTF8 codepoints. Arbitrary binary data isn't a sequence of UTF8 codepoints. However, that's what it usually is used for, incorrectly.
If you use JSON correctly, a JSON string is really just an UTF8 string. Leaving out the null bytes there would be annoying, yes, but usually doesn't hurt the use as a string...
A UTF-8 string can perfectly well contain the character U+0000, which will be encoded as a null byte 0x00.
So just "leaving out the null bytes" amounts to changing the string to a different string. Sometimes that may be what you want, but it doesn't sound like a good idea in general.
That was a bit of a rabbit hole. As it turns out, neither the ASCII nul character nor its Unicode equivalent have the semantic meaning "end of stream". ASCII does have control characters for "end of stream", in fact multiple: 0x03 End Of Text, 0x04 End Of Transmission, and 0x17 End Transmission Block. But 0x00 was never intended to be an end marker of any kind.
It's really only the C standard that (ab)uses NUL as an end-of-string marker. So yes, after consideration, I'm starting to agree with you. U+0000 is a valid character and should be allowed in text fields.
There are lots of programming languages where the "string" type is fine with nulls anywhere in the "string". And other databases allow it. Perhaps they shouldn't, but it's a sort of defacto standard.
In my last school year, after three month doing security CTF, and for all my C projects since, i started bringing with the string its length and did everything i could to avoid using \0. I think almost all non-deprecated C stream function give you the string length (or bits received rather).
The reasons are multiple: its faster to use write() than printf() (and you don't have to flush to synchronize), its way harder to exploit, and easier to move around in different buffers. And i got to wrote my own string library during my project, and that was fun.
Oh no please don't. Null characters in strings wreak all kinds of havoc on applications. Smuggling data in and out of somewhere because the validation only looks at the start of the string? Padding for fake checksums and signatures because half the string isn't shown and is free game? Length screwups because strlen tells something different from the actual size? Of course one could disregard that as problems with legacy code, but it ain't. Syscalls in most OSes only handle nullterminated strings. Same for a lot of network protocols where a null character is a terminator or separator.
And for what benefit? The only argument I've seen is "UTF8 doesn't forbid it". But that is quite weak, UTF8 doesn't prescribe it either, and for the enduser the null character has no meaning or representation that is any use, beyond end-of-string. UTF8 absolutely works fine without null characters.
And if you really want arbitrary binary data, use a blob, that's what thats for.
Don't what? The author isn't telling people to put NUL characters in their data; they're telling Postgres not to abort in the presence of a NUL character.
The main job of a database is to store and retrieve data. Erroring-out for valid UTF-8 strings can itself "wreak all kinds of havoc on applications", since it may be unexpected behaviour.
Personally, my expectation would be for a unicode string type to store any unicode string; and for data to be retrieved without modification. I would expect the same for any "generic plumbing" technology, whether it's a database, a programming language, a file system, a data processing command or API, etc.
As a side note, I tend to use property-based testing, which is especially good at ensuring NUL characters are handled correctly (property-based testing uses randomly generated inputs, starting with "small" values, and "shrinking" counterexamples; NUL is considered the "smallest" character, so it appears quite often).
The database throwing an exception can be unexpected, but is usually safe in that it usually doesn't create a lot of exploit opportunities. Allowing null bytes in strings is far worse, it is well known to be exploitable in numerous situations and historically has been. For example lots of ASN.1 and X.509 certificate exploits have been caused by allowing null bytes in strings there.
Those issues aren't generally actually due to NULL bytes in strings, but rather due to treating UTF-8 strings as C `char*` (or other mistreatment of "compatible" encodings). C code cannot handle UTF-8 strings as `char*`, even if you've got a UTF-8 locale (eg `LANG=en_US.UTF-8`). You have to convert multibyte strings to wide character strings (eg using `mbrtowc` to get a `wchar_t*` and then operate on that. C has multiple string types, no good type safety for them, and confusing them leads to vulnerabilities.
AFAIK (just browsed https://unicode.org/main.html for a while to check), UTF-8 doesn’t proscribe that ‘a’ is a valid character either (or just as much as it proscribes that 0x00 is one). Yet, saying your code supports UTF-8, but rejecting ‘a’ is not a good idea.
And yes, they do wreak havoc, but not more than ascii strings containing null characters.
Not to mention “inflammable”, which means the same as “flammable”. Because the former is so easily misunderstood – and has been – warning labels now use the latter.
Try "inter-" and "intra-", in a Boston accent. It makes a big difference if your analysis is interprocedural or intraprocedural, but they both sound the same in that accent...
Funny I have been making a 'data assembler' for AVR8 program memory which serializes a graph of data structures and all strings and arrays start with thr length.
The pattern I have seen the most is logs. More often than not logs may embed binary chunks, which may contain the infamous null character. However, the whole log itself is a string. For example, this happens on web server logs (binary data tends to be attack attempts).
Because (obviously) this doesn't happen on PostgreSQL databases, I have experienced this while migrating from either Oracle or MongoDB to PostgreSQL. Nulls in both. One on logs, the other one as part of a user name (!!!). But still there.
PostgreSQL must support this. It is long due and required, specially to help database migrations. The author is clearly on point.
I actually raised this topic in 2016: https://www.postgresql.org/message-id/5aa1df8a-96f5-1d14-46f... But the conclusion was on the lines of "it would be too complicated, closing as WON'T DO".