So while I like the idea of asn-like multiple encoding standards, there are some fragments in the description that make me just see potential issues:
> Support for the NUL character (u+0000) is implementation defined due to potential issues with null string delimiters in some languages and platforms. It should be avoided in general for safety and portability. Support for NUL must not be assumed.
I believe it's easier to force awkward languages to support nul character than deal with the chaos of not knowing if it's supported or not. They're are other places where they say "this is hard, we'll let implementations to not support it" and it's a ticking time bomb.
Also it doesn't look like you can deduplicate map keys:
> A reference must not be used as a map key.
Which means as much repetition and space waste as in json.
I'm still on the fence with NUL to be honest... I may remove it, but for now I want to garner discussion.
I'm not sure how disallowing a reference for a map key would cause repetition and space wasting. Keys are generally small, so a reference would in many cases end up taking a similar amount of space. Can you talk more about allowing references as keys? I need different points of view NOW, before I finalize v1 of the format! The negative is that it complicates processing, since you could reference a non-keyable value and attempt to use it in a key.
Everybody PLEASE chime in with your thoughts on the format, because I want to make sure I get it as close to right as I can before I finalize V1!
Start there. This might sound flippant and snarky but I promise you that it isn't. Don't ever leave anything to the implementations.
Oh, and build your own compliance tool. So people can just run the results of their own implementations against your tool and see if their code is compliant with your data format or not.
Not the GP but regarding no references for map keys; I can imagine that this would be an issue when representing a list of maps with uniform keys. E.g. a list of 1000 points with the keys “latitude” and “longitude”.
I can think of various more compact ways of structuring the same data, but JSON + gzip encodes the naive structure in a very compact way with very little programmer effort.
In general, the design seems to miss a compact representation of an array of homogeneous records, which is a common use case IME.
Edit: Also I agree with GP regarding NUL and other comments about no optional stuff.
Hmm yes I see how that could be useful as a kind of "pointer-to-key" setup. A key that is a ref to a non-keyable type would be invalid anyway...
I've added array types for arrays of common fixed-length types, but for records it gets complicated since it can only really work if all records are the same size. I could maybe expand array support to custom types for specialized applications that have a lot of record types (provided the custom record type is fixed size). But the primary purpose of the format is a way to allow disparate apps the ability to read each others data in a machine and human friendly way without requiring a bunch of extra pieces. A secondary concern is not trying to be everything for everyone, not complicating the format for too small a gain. But that's the trick, isn't it? (where to draw the line)
I was thinking of use cases like large slightly-heterogenous data dumps. Or structured logs. Let's say you have a few million entries, mostly with the same fields, but with enough differences that using a header + array of arrays is also not great.
But! This is totally a specific use case and you don't need to try make everyone happy. There's GELF and others for structured logs already.
Optionality is a bigger problem than people realize, these special cases force “conditional pollution”, and when the optionality is implemented with exceptions, you now have to understand non-local control flow.
Make everyone’s lives easier and make a spec where everything is mandatory.
I think the best way to deal with this is explicit forking / special-purpose subspec. Where again, in that spec, things are different / extended, but still mandatory. And if consumers fall into that usecase, they can explicitly advertise / consume that version.
The basic premise (The friendly data format for human & machine) instantly reminded me of Red language [1], which, by the virtue of homoiconicity, is also its own data format (both textual and binary [2]) with 50 datatypes, many with several unique literal forms:
If you have ideas or comments on the design of the format, please let me know! I've been working on this for 3 years and am getting close to release, but there are always blind spots when working alone.
The basic premise is that you can never marry "easy to edit" with "efficient to process". They pull in opposing directions, so the only alternative is to have two 100% compatible formats: one in text and one in binary.
I've chosen the data types carefully to support the most common types that come up in real world situations (and have included custom type support for data not intended for public consumption). The compromises made should allow most people to work without adding extra encoding on top of the encoding (base64, special text parsing and such).
The reference implementation is working for all common cases, and mostly there for array types. Once that's finished, I'll start on the schema design.
In a project like this I look for a "related work" section (that shows you know the context and are at least not reinventing [square] wheels and have some reason why your thing is different/better than all the other things that do what your thing does already) and a formal grammar (that I can feed to a parser generator etc.).
Are you aware of EDN, Transit or Fressian, which IMO improve a lot on the currently common approaches to encoding (JSON / YAML)? If so, how does Concise Encoding differentiate itself?
Yes, I've looked at EDN before. It's solving a similar problem, but not the same one. EDN is focused on extensibility, where as Concise Encoding is focused on concision, and binary-text compatibility.
Transit is a bridging format to other encoding formats, which is very cool, but a fairly different problem space.
Fressian is pretty close, but is binary only. The main purpose of Concise Encoding is binary-text compatible formats.
You might spend a little more saying what makes your approach unique:
"Use text based formats that are bloated and slow, or use binary formats that humans can't read. Wouldn't it be nice to have the benefits of both, and none of the drawbacks?"
NeXT/Apple property lists have had that feature for decades now, with multiple text and binary serialization formats.[1]
Infra [2] also makes very similar claims "Existing metaformats fit neatly into two categories. They are either textual for human-readability (such as XML and JSON) or binary for compact serialization (such as Thrift and Protocol Buffers). Infra can play the role of either, imbuing each with the desirable properties of the other."
Property lists came very close, but still suffer from a number of issues:
- The binary format is not efficiently packed. For example, Concise Encoding uses 200 of the 256 codes to directly encode integers (-100 to 100) since they are the most common data values in the wild (along with short strings, which also have their own special encoding).
- There's no array type (as in contiguous objects of the same type and size)
- Dates are in seconds, not Gregorian fields, and have no time zones.
- Binary data in the text format is base64 encoded, ensuring that it's 100% unreadable in its raw form.
- Container types are prepended with a length field, making progressive construction impossible.
Infra looks interesting, but it strays from editable text into a kind of hybrid format that requires a specialized editor to read and write. This would only work if such editors became ubiquitous, which is unlikely. I ran down this path for awhile as well, but finally decided against it.
How does this address format creep? Why is it going to be better to support all the existing formats and now this one too? Does it really add that much?
The first formats filled a void. This format joins many players already on the field.
It's versioned, so if we come up with significant improvements later, we can bump the version.
This format fills two voids:
1. The lack of native types in most of the other formats
2. The lack of either editability or processing efficiency, depending on whether it's a binary or text format.
I developed this format because I'm tired of putting encodings on top of encodings (i.e. base64) and other such tricks, just to get my data across to the other side, and I want a general purpose data format that's efficient as well as readable.
Yes, and because of that it suffers the same downsides of JSON. You also can't just transparently convert CBOR to JSON because of the data type disparity.
I think it’s interesting that they included a markup type. Based on the spec it’s „similar to XML“, so I suppose this markup is of their own making.
It doesn’t seem like this would make transmitting XML or HTML significantly more efficient to transmit compared to just gzipping a string. And if you wanted to use this, you’d have to write the conversion first.
Can someone help me understand the use case for this, ie in which situations this would be super useful?
The idea is to allow presentation data to be included inside of general data, and also to allow presentation data to contain any kind of general data (not just string keys and string values). At some point in the distant future, it would also be nice to have official numeric tag assignments in schemas (e.g. 1 = "View", 2 = "TextBox", etc) to further improve processing by eliminating the need for text parsing just to decide what kind of object to present. These would be transparently converted from the text-based names in the human-readable text format when converting to the machine-read binary format.
The overarching purpose of Concise Encoding is to bring forth the tools to make transmitted data accessible to humans, and at the same time efficient to process by machines.
I really like the inclusion of hexfloats (beware of C++ iostream bugs, though), because they're wayyyyy easier to parse correctly than decimal floats (way faster, too). As an added bonus they're reasonably easy to read for developers, too, since they're just scientific notation in hex.
Native support for complex types like URLs and UUIDs is a mistake. It just makes it more complicated and harder to interoperate with other formats like JSON. Look at all the archaic types ASN.1 supports.
URIs and UUIDs are never going away. We're always going to need a way to lookup resources, and we're always going to need unique identifiers. To not include them is a mistake, because you then have to add a (potentially incompatible) extra layer of encoding at the application level just to store them in the encoding format.
URL parsing is a nightmare. There are lots of RFCs, but the real world is full of quirks. This will lead to edge case differences between the implementations.
Also, which URI standard is part of the spec, what if the URI standard evolves? – It is a complex standard with multiple RFCs and a long history.
What of possible, different length restrictions in language native URI types?
I have no stake in this though, feel free to ignore. I tend to see negatives first.
Yes, it's true that URLs are a nightmare, but we don't have anything better (yet). Once we do, I'll happily release v2 of the spec. For now, it follows the RFC.
TBH at this point it doesn't really matter what the URI specifications say. Somehow we're able to stuff our URLs into our browsers, web pages, package managers, REST APIs, mail clients etc and manage to get it working. It's useful enough that everyone uses it, so I'd be a fool to get stuck on the "correctness" of their specs. CTE uses the double-quote as a delimiter, so as long as the contents are percent-escaped for double-quote and your language's URL parser says "cool", it's acceptable.
There's nothing stopping someone from putting a URL validator into their codec, but it's probably a lot less work to just pass it to your runtime library's URL parser.
I disagree. Just because a common data format doesn’t support those types doesn’t mean other formats have to do the same mistake (and given that JSON lacks comments and an official formal spec, its design is questionable to begin with). If interoperability is needed with JSON or other formats not supporting date or uuid, such data type can be serialized as strings.
Although it uses text values for delimiters, Bencode isn't designed to be edited in text mode. If you want to transmit binary data that can be efficiently read by a machine, you cannot also make it human readable/editable - they have opposing goals. The only way to get both is to have a twin format (binary and text) that is 1:1 compatible and transparently convertible.
> Support for the NUL character (u+0000) is implementation defined due to potential issues with null string delimiters in some languages and platforms. It should be avoided in general for safety and portability. Support for NUL must not be assumed.
I believe it's easier to force awkward languages to support nul character than deal with the chaos of not knowing if it's supported or not. They're are other places where they say "this is hard, we'll let implementations to not support it" and it's a ticking time bomb.
Also it doesn't look like you can deduplicate map keys:
> A reference must not be used as a map key.
Which means as much repetition and space waste as in json.