What's the argument for allowing null bytes in text? I think this wasn't argued ...

chriswarbo · on Feb 1, 2021

The same as the argument for allowing BEL, DEL, BOM, and any other control codes or special characters: implementing a standard as described, with minimal divergence or unexpected behaviour, to ensure compatibility with arbitrary external software.

taneq · on Feb 1, 2021

I think the argument was just “text is incredibly complex with tons of obscure rules once you accept not-English-Latin text, so it’s best not to assume anything and just go by the standard.”

probably_wrong · on Feb 1, 2021

While I agree with you that this is one of the arguments the author puts forward, it does little to convince me.

I can support that line of thought if we want to talk, for instance, about why Rust doesn't provide a "proper" iteration over characters [1] in their standard library. But in this case the Null character is an artificial construct that does not naturally exist in any language on Earth.

I sincerely doubt someone will ever find a naturally-ocurring "John \0 Doe". What I don't doubt is that such a name occurrs in a badly-migrated database somewhere, but that's a different discussion.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

gbear0 · on Feb 1, 2021

Actually I can think of one place I know I've seen explicit null characters in text, and that's a Facebook Business Record pdf. If you download your FB data the text instructions of all the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER!

I found it extremely annoying at first cause I was trying to copy/paste the stream chunks around and it wouldn't copy anything after the fist null. Then I realized this was probably a security hack in the hopes that people couldn't copy the data around (I can't think of any other reason to add these nulls like this otherwise). Funny enough, I opened the PDF in chrome and copy/paste of the selected text works fine. So clearly some readers strip these bad characters, but I can imagine others might not.

jfk13 · on Feb 1, 2021

> the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER

That sounds more like you're looking at UTF-16 data and trying to interpret it as ASCII.

gbear0 · on Feb 1, 2021

It's only the text instructions that have this, not the rest of the text. ie one line of the content looks like this, where it's trying to write the text 'Service'

  BT 0 Tr 0.000000 w ET BT 44.814370 775.487087 Td [(\0S\0e\0r\0v\0i\0c\0e)] TJ ET

jfk13 · on Feb 1, 2021

This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8.

One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150.

But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.

taneq · on Feb 2, 2021

And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.

wodenokoto · on Feb 1, 2021

The arguments in the text:

- Valid Unicode strings may contain null-bytes characters.

- valid json files may carry null-bytes as part of important and meaningful data

yunruse · on Feb 1, 2021

As weird and inefficient as it may seem at first glance, JSON in a database can have a variety of uses, so I certainly see the benefit.

But… is there any use for a null byte explicitly in a JSON string? The de facto standard (easy to use and instantly recognisable) for blobs seems to be base64. I can’t think of any meaningful data benefits other than data efficiency (which is easily mitigated by storing the blob directly in the database).

corty · on Feb 1, 2021

JSON strings must conform to UTF8. Storing plain non-base64 blobs in them is abuse of broken JSON parsers.

zaphar · on Feb 1, 2021

That second one is probably the more important one. A null byte in plain text is probably an error. A null byte in a json string may be intentional if perhaps questionable. JSON strings are frequently abused for various transport formats.

eptcyka · on Feb 1, 2021

If I was responsible for ingesting and returning JSON strings from a server, my attidue towards people who require nulls in them would be rather juvenile and brash - "lmao fuck em".

corty · on Feb 1, 2021

json carrying important and meaningful data in strings (i.e. blobs in strings) is usually just an exploit on the non-conformance of parsers. JSON strings are defined by ECMA-404 to be UTF8 codepoints. Arbitrary binary data isn't a sequence of UTF8 codepoints. However, that's what it usually is used for, incorrectly.

If you use JSON correctly, a JSON string is really just an UTF8 string. Leaving out the null bytes there would be annoying, yes, but usually doesn't hurt the use as a string...

jfk13 · on Feb 1, 2021

A UTF-8 string can perfectly well contain the character U+0000, which will be encoded as a null byte 0x00.

So just "leaving out the null bytes" amounts to changing the string to a different string. Sometimes that may be what you want, but it doesn't sound like a good idea in general.

tremon · on Feb 1, 2021

That was a bit of a rabbit hole. As it turns out, neither the ASCII nul character nor its Unicode equivalent have the semantic meaning "end of stream". ASCII does have control characters for "end of stream", in fact multiple: 0x03 End Of Text, 0x04 End Of Transmission, and 0x17 End Transmission Block. But 0x00 was never intended to be an end marker of any kind.

It's really only the C standard that (ab)uses NUL as an end-of-string marker. So yes, after consideration, I'm starting to agree with you. U+0000 is a valid character and should be allowed in text fields.

lolc · on Feb 1, 2021

Another reason mentioned is consistency with other RDBMS.

tyingq · on Feb 1, 2021

There are lots of programming languages where the "string" type is fine with nulls anywhere in the "string". And other databases allow it. Perhaps they shouldn't, but it's a sort of defacto standard.

legulere · on Feb 1, 2021

It helps pressure other people to not use null-delimited string routines which are also prone to buffer overflow.