Oh no please don't. Null characters in strings wreak all kinds of havoc on appli...

chriswarbo · on Feb 1, 2021

> Oh no please don't.

Don't what? The author isn't telling people to put NUL characters in their data; they're telling Postgres not to abort in the presence of a NUL character.

The main job of a database is to store and retrieve data. Erroring-out for valid UTF-8 strings can itself "wreak all kinds of havoc on applications", since it may be unexpected behaviour.

Personally, my expectation would be for a unicode string type to store any unicode string; and for data to be retrieved without modification. I would expect the same for any "generic plumbing" technology, whether it's a database, a programming language, a file system, a data processing command or API, etc.

As a side note, I tend to use property-based testing, which is especially good at ensuring NUL characters are handled correctly (property-based testing uses randomly generated inputs, starting with "small" values, and "shrinking" counterexamples; NUL is considered the "smallest" character, so it appears quite often).

corty · on Feb 1, 2021

The database throwing an exception can be unexpected, but is usually safe in that it usually doesn't create a lot of exploit opportunities. Allowing null bytes in strings is far worse, it is well known to be exploitable in numerous situations and historically has been. For example lots of ASN.1 and X.509 certificate exploits have been caused by allowing null bytes in strings there.

SAI_Peregrinus · on Feb 1, 2021

Those issues aren't generally actually due to NULL bytes in strings, but rather due to treating UTF-8 strings as C `char*` (or other mistreatment of "compatible" encodings). C code cannot handle UTF-8 strings as `char*`, even if you've got a UTF-8 locale (eg `LANG=en_US.UTF-8`). You have to convert multibyte strings to wide character strings (eg using `mbrtowc` to get a `wchar_t*` and then operate on that. C has multiple string types, no good type safety for them, and confusing them leads to vulnerabilities.

Someone · on Feb 1, 2021

AFAIK (just browsed https://unicode.org/main.html for a while to check), UTF-8 doesn’t proscribe that ‘a’ is a valid character either (or just as much as it proscribes that 0x00 is one). Yet, saying your code supports UTF-8, but rejecting ‘a’ is not a good idea.

And yes, they do wreak havoc, but not more than ascii strings containing null characters.

prionassembly · on Feb 1, 2021

Heads up, "proscribe" also means "forbid".

Pxtl · on Feb 1, 2021

Hooray for English words where antonyms sound almost exactly the same. Prescribe/proscribe.

pavlov · on Feb 1, 2021

Also fun is the “de” prefix which can mean either removal or its opposite, intensification:

https://en.wiktionary.org/wiki/de-

Hence “to defraud” doesn’t mean removing fraud.

hanche · on Feb 1, 2021

Not to mention “inflammable”, which means the same as “flammable”. Because the former is so easily misunderstood – and has been – warning labels now use the latter.

jcranmer · on Feb 1, 2021

Try "inter-" and "intra-", in a Boston accent. It makes a big difference if your analysis is interprocedural or intraprocedural, but they both sound the same in that accent...

corty · on Feb 1, 2021

Thanks, fixed it.