Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What's the argument for allowing null bytes in text? I think this wasn't argued for well enough in the article.

I see a big risk of opening yourself up to all kinds of bugs with this, for no apparent benefit. The article mentions null byte injection for example.



The same as the argument for allowing BEL, DEL, BOM, and any other control codes or special characters: implementing a standard as described, with minimal divergence or unexpected behaviour, to ensure compatibility with arbitrary external software.


I think the argument was just “text is incredibly complex with tons of obscure rules once you accept not-English-Latin text, so it’s best not to assume anything and just go by the standard.”


While I agree with you that this is one of the arguments the author puts forward, it does little to convince me.

I can support that line of thought if we want to talk, for instance, about why Rust doesn't provide a "proper" iteration over characters [1] in their standard library. But in this case the Null character is an artificial construct that does not naturally exist in any language on Earth.

I sincerely doubt someone will ever find a naturally-ocurring "John \0 Doe". What I don't doubt is that such a name occurrs in a badly-migrated database somewhere, but that's a different discussion.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...


Actually I can think of one place I know I've seen explicit null characters in text, and that's a Facebook Business Record pdf. If you download your FB data the text instructions of all the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER!

I found it extremely annoying at first cause I was trying to copy/paste the stream chunks around and it wouldn't copy anything after the fist null. Then I realized this was probably a security hack in the hopes that people couldn't copy the data around (I can't think of any other reason to add these nulls like this otherwise). Funny enough, I opened the PDF in chrome and copy/paste of the selected text works fine. So clearly some readers strip these bad characters, but I can imagine others might not.


> the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER

That sounds more like you're looking at UTF-16 data and trying to interpret it as ASCII.


It's only the text instructions that have this, not the rest of the text. ie one line of the content looks like this, where it's trying to write the text 'Service'

  BT 0 Tr 0.000000 w ET BT 44.814370 775.487087 Td [(\0S\0e\0r\0v\0i\0c\0e)] TJ ET


This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8.

One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150.

But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.


And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.


The arguments in the text:

- Valid Unicode strings may contain null-bytes characters.

- valid json files may carry null-bytes as part of important and meaningful data


As weird and inefficient as it may seem at first glance, JSON in a database can have a variety of uses, so I certainly see the benefit.

But… is there any use for a null byte explicitly in a JSON string? The de facto standard (easy to use and instantly recognisable) for blobs seems to be base64. I can’t think of any meaningful data benefits other than data efficiency (which is easily mitigated by storing the blob directly in the database).


JSON strings must conform to UTF8. Storing plain non-base64 blobs in them is abuse of broken JSON parsers.


That second one is probably the more important one. A null byte in plain text is probably an error. A null byte in a json string may be intentional if perhaps questionable. JSON strings are frequently abused for various transport formats.


If I was responsible for ingesting and returning JSON strings from a server, my attidue towards people who require nulls in them would be rather juvenile and brash - "lmao fuck em".


json carrying important and meaningful data in strings (i.e. blobs in strings) is usually just an exploit on the non-conformance of parsers. JSON strings are defined by ECMA-404 to be UTF8 codepoints. Arbitrary binary data isn't a sequence of UTF8 codepoints. However, that's what it usually is used for, incorrectly.

If you use JSON correctly, a JSON string is really just an UTF8 string. Leaving out the null bytes there would be annoying, yes, but usually doesn't hurt the use as a string...


A UTF-8 string can perfectly well contain the character U+0000, which will be encoded as a null byte 0x00.

So just "leaving out the null bytes" amounts to changing the string to a different string. Sometimes that may be what you want, but it doesn't sound like a good idea in general.


That was a bit of a rabbit hole. As it turns out, neither the ASCII nul character nor its Unicode equivalent have the semantic meaning "end of stream". ASCII does have control characters for "end of stream", in fact multiple: 0x03 End Of Text, 0x04 End Of Transmission, and 0x17 End Transmission Block. But 0x00 was never intended to be an end marker of any kind.

It's really only the C standard that (ab)uses NUL as an end-of-string marker. So yes, after consideration, I'm starting to agree with you. U+0000 is a valid character and should be allowed in text fields.


Another reason mentioned is consistency with other RDBMS.


There are lots of programming languages where the "string" type is fine with nulls anywhere in the "string". And other databases allow it. Perhaps they shouldn't, but it's a sort of defacto standard.


It helps pressure other people to not use null-delimited string routines which are also prone to buffer overflow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: