Null characters in Postgres: workarounds aren’t good enough

ahachete · on Feb 1, 2021

I have been bitten by this many times. Null character in strings, as weird as it sounds like, happens in real life. It just happens.

The pattern I have seen the most is logs. More often than not logs may embed binary chunks, which may contain the infamous null character. However, the whole log itself is a string. For example, this happens on web server logs (binary data tends to be attack attempts).

Because (obviously) this doesn't happen on PostgreSQL databases, I have experienced this while migrating from either Oracle or MongoDB to PostgreSQL. Nulls in both. One on logs, the other one as part of a user name (!!!). But still there.

PostgreSQL must support this. It is long due and required, specially to help database migrations. The author is clearly on point.

I actually raised this topic in 2016: https://www.postgresql.org/message-id/5aa1df8a-96f5-1d14-46f... But the conclusion was on the lines of "it would be too complicated, closing as WON'T DO".

corty · on Feb 1, 2021

A log containing binary chunks is not a string, because it will probably contain invalid characters, at least if your string encoding has the concept of invalid characters like UTF8 has. Since logs usually are not in any fixed encoding but in whatever the currently running process uses, even the string parts of your log are unlikely to always conform to UTF8. So blob would be the appropriate data type anyways.

ahachete · on Feb 1, 2021

Logs, unless specifically being told otherwise, are assumed to be in text format. Even if they may contain binary blobs (which are normally unexpected, and not really conformant), they are still supposed to be text format.

Surely you can store anything on a bytea. However, that won't take you very far if you want to query the logs (partially or completely) from the database --which is indeed extremely useful.

In any case, Postgres always strived for compliance. And here, it is not accepting a UTF-8 valid character, hence is being non-compliant.

pkhuong · on Feb 1, 2021

I switched some in-house database written in C to a variant of UTF-8b where the NUL byte is also surrogate escaped, mostly because of pervasive NUL termination issues as mentioned in the OP.

Escaping NULs wasn't the only reason we chose that encoding. UTF-8b correctly roundtrips not-quite-UTF8 data instead of erroring out. That's useful when you don't have full control over the data generation process, or expect rare corruption (e.g., in log data or catastrophic error reports).

A UTF-8 column type is not just a storage choice, it's also a constraint on your data... if it doesn't make sense to forbid NULs, does it make sense to demand that the data be well-formed UTF-8?

formerly_proven · on Feb 1, 2021

Surrogate escaping / UTF-8b is kinda genius. Unicode has two reserved ranges of 256 values each ("surrogates") reserved for the UTF-16 encoding, which are used there for encoding stuff outside the BMP. Surrogate escaping realizes that 1) These never appear in valid UTF-8, because UTF-8 is a variable-length integer, and doesn't need surrogates 2) So you can use them to pass-through binary data in UTF-8.

CUViper · on Feb 1, 2021

On the range, the surrogates are actually 10 bits, 1024 possible values. The high surrogate is encoded in 0xD800-0xDBFF, and low in 0xDC00-0xDFFF.

nr2x · on Feb 1, 2021

I have a large scale web analysis platform and the amount of effort I’ve had to put in to handling this is immense. You don’t appreciate how good browsers are at handling malformed code until you try to ingest tens of millions of pages and scripts. I switched to Postgres partially to get away from MySQL really poor handling of emoji, and got this bug in place.

nabla9 · on Feb 1, 2021

Unicode standard has made bad choices over the years. "Because it's valid Unicode" is not a good reason to do anything. It's better to detect errors in strings sooner than later.

If you want to carry any valid UTF-8 string, you can as well treat them as binary blobs and solve the issue.

chriswarbo · on Feb 1, 2021

Can I ask what the point of a "string" type would be, if it causes unexpected errors and should be avoided in favour of blobs?

nabla9 · on Feb 1, 2021

Nothing. If you accept only workable specific subset of Unicode and validate input, you can get the point back. Using utf-8 encoding for text is almost always the right ting to do, treating it as generic Unicode string is usually wrong.

Unicode is as complex as UTF-8 encoding is simple an neat.

Joel Spolsky studied Unicode and was so sure that he understood it that he wrote a short intro for programmers without knowing that code points in Unicode don't correspond to "platonic ideal letters" aka user perceived characters. They correspond only in every case he used them (user-perceived character may actually be a sequence of code points).

Yaggo · on Feb 1, 2021

Yeah. There's probably all kind of obscure use cases for ASCII, but generally, any text meant to be human readable should be Unicode encoded, whether you call it "string" or "blob".

hamandcheese · on Feb 1, 2021

In the case of Postgres, you need string types if you intend to do any string operations on those strings in the database. As mentioned in the article, you also have no choice if you want to use jsonb.

Supermancho · on Feb 1, 2021

> "Because it's valid Unicode" is not a good reason to do anything.

There is the benefit of consistency. This is ostensibly, the issue at hand.

eska · on Feb 1, 2021

What's the argument for allowing null bytes in text? I think this wasn't argued for well enough in the article.

I see a big risk of opening yourself up to all kinds of bugs with this, for no apparent benefit. The article mentions null byte injection for example.

chriswarbo · on Feb 1, 2021

The same as the argument for allowing BEL, DEL, BOM, and any other control codes or special characters: implementing a standard as described, with minimal divergence or unexpected behaviour, to ensure compatibility with arbitrary external software.

taneq · on Feb 1, 2021

I think the argument was just “text is incredibly complex with tons of obscure rules once you accept not-English-Latin text, so it’s best not to assume anything and just go by the standard.”

probably_wrong · on Feb 1, 2021

While I agree with you that this is one of the arguments the author puts forward, it does little to convince me.

I can support that line of thought if we want to talk, for instance, about why Rust doesn't provide a "proper" iteration over characters [1] in their standard library. But in this case the Null character is an artificial construct that does not naturally exist in any language on Earth.

I sincerely doubt someone will ever find a naturally-ocurring "John \0 Doe". What I don't doubt is that such a name occurrs in a badly-migrated database somewhere, but that's a different discussion.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

gbear0 · on Feb 1, 2021

Actually I can think of one place I know I've seen explicit null characters in text, and that's a Facebook Business Record pdf. If you download your FB data the text instructions of all the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER!

I found it extremely annoying at first cause I was trying to copy/paste the stream chunks around and it wouldn't copy anything after the fist null. Then I realized this was probably a security hack in the hopes that people couldn't copy the data around (I can't think of any other reason to add these nulls like this otherwise). Funny enough, I opened the PDF in chrome and copy/paste of the selected text works fine. So clearly some readers strip these bad characters, but I can imagine others might not.

jfk13 · on Feb 1, 2021

> the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER

That sounds more like you're looking at UTF-16 data and trying to interpret it as ASCII.

gbear0 · on Feb 1, 2021

It's only the text instructions that have this, not the rest of the text. ie one line of the content looks like this, where it's trying to write the text 'Service'

  BT 0 Tr 0.000000 w ET BT 44.814370 775.487087 Td [(\0S\0e\0r\0v\0i\0c\0e)] TJ ET

jfk13 · on Feb 1, 2021

This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8.

One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150.

But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.

taneq · on Feb 2, 2021

And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.

wodenokoto · on Feb 1, 2021

The arguments in the text:

- Valid Unicode strings may contain null-bytes characters.

- valid json files may carry null-bytes as part of important and meaningful data

yunruse · on Feb 1, 2021

As weird and inefficient as it may seem at first glance, JSON in a database can have a variety of uses, so I certainly see the benefit.

But… is there any use for a null byte explicitly in a JSON string? The de facto standard (easy to use and instantly recognisable) for blobs seems to be base64. I can’t think of any meaningful data benefits other than data efficiency (which is easily mitigated by storing the blob directly in the database).

corty · on Feb 1, 2021

JSON strings must conform to UTF8. Storing plain non-base64 blobs in them is abuse of broken JSON parsers.

zaphar · on Feb 1, 2021

That second one is probably the more important one. A null byte in plain text is probably an error. A null byte in a json string may be intentional if perhaps questionable. JSON strings are frequently abused for various transport formats.

eptcyka · on Feb 1, 2021

If I was responsible for ingesting and returning JSON strings from a server, my attidue towards people who require nulls in them would be rather juvenile and brash - "lmao fuck em".

corty · on Feb 1, 2021

json carrying important and meaningful data in strings (i.e. blobs in strings) is usually just an exploit on the non-conformance of parsers. JSON strings are defined by ECMA-404 to be UTF8 codepoints. Arbitrary binary data isn't a sequence of UTF8 codepoints. However, that's what it usually is used for, incorrectly.

If you use JSON correctly, a JSON string is really just an UTF8 string. Leaving out the null bytes there would be annoying, yes, but usually doesn't hurt the use as a string...

jfk13 · on Feb 1, 2021

A UTF-8 string can perfectly well contain the character U+0000, which will be encoded as a null byte 0x00.

So just "leaving out the null bytes" amounts to changing the string to a different string. Sometimes that may be what you want, but it doesn't sound like a good idea in general.

tremon · on Feb 1, 2021

That was a bit of a rabbit hole. As it turns out, neither the ASCII nul character nor its Unicode equivalent have the semantic meaning "end of stream". ASCII does have control characters for "end of stream", in fact multiple: 0x03 End Of Text, 0x04 End Of Transmission, and 0x17 End Transmission Block. But 0x00 was never intended to be an end marker of any kind.

It's really only the C standard that (ab)uses NUL as an end-of-string marker. So yes, after consideration, I'm starting to agree with you. U+0000 is a valid character and should be allowed in text fields.

lolc · on Feb 1, 2021

Another reason mentioned is consistency with other RDBMS.

tyingq · on Feb 1, 2021

There are lots of programming languages where the "string" type is fine with nulls anywhere in the "string". And other databases allow it. Perhaps they shouldn't, but it's a sort of defacto standard.

legulere · on Feb 1, 2021

It helps pressure other people to not use null-delimited string routines which are also prone to buffer overflow.

orwin · on Feb 1, 2021

In my last school year, after three month doing security CTF, and for all my C projects since, i started bringing with the string its length and did everything i could to avoid using \0. I think almost all non-deprecated C stream function give you the string length (or bits received rather).

The reasons are multiple: its faster to use write() than printf() (and you don't have to flush to synchronize), its way harder to exploit, and easier to move around in different buffers. And i got to wrote my own string library during my project, and that was fun.

corty · on Feb 1, 2021

Oh no please don't. Null characters in strings wreak all kinds of havoc on applications. Smuggling data in and out of somewhere because the validation only looks at the start of the string? Padding for fake checksums and signatures because half the string isn't shown and is free game? Length screwups because strlen tells something different from the actual size? Of course one could disregard that as problems with legacy code, but it ain't. Syscalls in most OSes only handle nullterminated strings. Same for a lot of network protocols where a null character is a terminator or separator.

And for what benefit? The only argument I've seen is "UTF8 doesn't forbid it". But that is quite weak, UTF8 doesn't prescribe it either, and for the enduser the null character has no meaning or representation that is any use, beyond end-of-string. UTF8 absolutely works fine without null characters.

And if you really want arbitrary binary data, use a blob, that's what thats for.

chriswarbo · on Feb 1, 2021

> Oh no please don't.

Don't what? The author isn't telling people to put NUL characters in their data; they're telling Postgres not to abort in the presence of a NUL character.

The main job of a database is to store and retrieve data. Erroring-out for valid UTF-8 strings can itself "wreak all kinds of havoc on applications", since it may be unexpected behaviour.

Personally, my expectation would be for a unicode string type to store any unicode string; and for data to be retrieved without modification. I would expect the same for any "generic plumbing" technology, whether it's a database, a programming language, a file system, a data processing command or API, etc.

As a side note, I tend to use property-based testing, which is especially good at ensuring NUL characters are handled correctly (property-based testing uses randomly generated inputs, starting with "small" values, and "shrinking" counterexamples; NUL is considered the "smallest" character, so it appears quite often).

corty · on Feb 1, 2021

The database throwing an exception can be unexpected, but is usually safe in that it usually doesn't create a lot of exploit opportunities. Allowing null bytes in strings is far worse, it is well known to be exploitable in numerous situations and historically has been. For example lots of ASN.1 and X.509 certificate exploits have been caused by allowing null bytes in strings there.

SAI_Peregrinus · on Feb 1, 2021

Those issues aren't generally actually due to NULL bytes in strings, but rather due to treating UTF-8 strings as C `char*` (or other mistreatment of "compatible" encodings). C code cannot handle UTF-8 strings as `char*`, even if you've got a UTF-8 locale (eg `LANG=en_US.UTF-8`). You have to convert multibyte strings to wide character strings (eg using `mbrtowc` to get a `wchar_t*` and then operate on that. C has multiple string types, no good type safety for them, and confusing them leads to vulnerabilities.

Someone · on Feb 1, 2021

AFAIK (just browsed https://unicode.org/main.html for a while to check), UTF-8 doesn’t proscribe that ‘a’ is a valid character either (or just as much as it proscribes that 0x00 is one). Yet, saying your code supports UTF-8, but rejecting ‘a’ is not a good idea.

And yes, they do wreak havoc, but not more than ascii strings containing null characters.

prionassembly · on Feb 1, 2021

Heads up, "proscribe" also means "forbid".

Pxtl · on Feb 1, 2021

Hooray for English words where antonyms sound almost exactly the same. Prescribe/proscribe.

pavlov · on Feb 1, 2021

Also fun is the “de” prefix which can mean either removal or its opposite, intensification:

https://en.wiktionary.org/wiki/de-

Hence “to defraud” doesn’t mean removing fraud.

hanche · on Feb 1, 2021

Not to mention “inflammable”, which means the same as “flammable”. Because the former is so easily misunderstood – and has been – warning labels now use the latter.

jcranmer · on Feb 1, 2021

Try "inter-" and "intra-", in a Boston accent. It makes a big difference if your analysis is interprocedural or intraprocedural, but they both sound the same in that accent...

corty · on Feb 1, 2021

Thanks, fixed it.

ape4 · on Feb 1, 2021

We're going to need a long transition period where programs don't exception if they get a null in a string.

PaulHoule · on Feb 1, 2021

Funny I have been making a 'data assembler' for AVR8 program memory which serializes a graph of data structures and all strings and arrays start with thr length.