Oh no please don't. Null characters in strings wreak all kinds of havoc on applications. Smuggling data in and out of somewhere because the validation only looks at the start of the string? Padding for fake checksums and signatures because half the string isn't shown and is free game? Length screwups because strlen tells something different from the actual size? Of course one could disregard that as problems with legacy code, but it ain't. Syscalls in most OSes only handle nullterminated strings. Same for a lot of network protocols where a null character is a terminator or separator.
And for what benefit? The only argument I've seen is "UTF8 doesn't forbid it". But that is quite weak, UTF8 doesn't prescribe it either, and for the enduser the null character has no meaning or representation that is any use, beyond end-of-string. UTF8 absolutely works fine without null characters.
And if you really want arbitrary binary data, use a blob, that's what thats for.
Don't what? The author isn't telling people to put NUL characters in their data; they're telling Postgres not to abort in the presence of a NUL character.
The main job of a database is to store and retrieve data. Erroring-out for valid UTF-8 strings can itself "wreak all kinds of havoc on applications", since it may be unexpected behaviour.
Personally, my expectation would be for a unicode string type to store any unicode string; and for data to be retrieved without modification. I would expect the same for any "generic plumbing" technology, whether it's a database, a programming language, a file system, a data processing command or API, etc.
As a side note, I tend to use property-based testing, which is especially good at ensuring NUL characters are handled correctly (property-based testing uses randomly generated inputs, starting with "small" values, and "shrinking" counterexamples; NUL is considered the "smallest" character, so it appears quite often).
The database throwing an exception can be unexpected, but is usually safe in that it usually doesn't create a lot of exploit opportunities. Allowing null bytes in strings is far worse, it is well known to be exploitable in numerous situations and historically has been. For example lots of ASN.1 and X.509 certificate exploits have been caused by allowing null bytes in strings there.
Those issues aren't generally actually due to NULL bytes in strings, but rather due to treating UTF-8 strings as C `char*` (or other mistreatment of "compatible" encodings). C code cannot handle UTF-8 strings as `char*`, even if you've got a UTF-8 locale (eg `LANG=en_US.UTF-8`). You have to convert multibyte strings to wide character strings (eg using `mbrtowc` to get a `wchar_t*` and then operate on that. C has multiple string types, no good type safety for them, and confusing them leads to vulnerabilities.
AFAIK (just browsed https://unicode.org/main.html for a while to check), UTF-8 doesn’t proscribe that ‘a’ is a valid character either (or just as much as it proscribes that 0x00 is one). Yet, saying your code supports UTF-8, but rejecting ‘a’ is not a good idea.
And yes, they do wreak havoc, but not more than ascii strings containing null characters.
Not to mention “inflammable”, which means the same as “flammable”. Because the former is so easily misunderstood – and has been – warning labels now use the latter.
Try "inter-" and "intra-", in a Boston accent. It makes a big difference if your analysis is interprocedural or intraprocedural, but they both sound the same in that accent...
And for what benefit? The only argument I've seen is "UTF8 doesn't forbid it". But that is quite weak, UTF8 doesn't prescribe it either, and for the enduser the null character has no meaning or representation that is any use, beyond end-of-string. UTF8 absolutely works fine without null characters.
And if you really want arbitrary binary data, use a blob, that's what thats for.