Show HN: Concise Encoding – a friendly data format for human and machine

viraptor · on Oct 4, 2020

So while I like the idea of asn-like multiple encoding standards, there are some fragments in the description that make me just see potential issues:

> Support for the NUL character (u+0000) is implementation defined due to potential issues with null string delimiters in some languages and platforms. It should be avoided in general for safety and portability. Support for NUL must not be assumed.

I believe it's easier to force awkward languages to support nul character than deal with the chaos of not knowing if it's supported or not. They're are other places where they say "this is hard, we'll let implementations to not support it" and it's a ticking time bomb.

Also it doesn't look like you can deduplicate map keys:

> A reference must not be used as a map key.

Which means as much repetition and space waste as in json.

kstenerud · on Oct 4, 2020

I'm still on the fence with NUL to be honest... I may remove it, but for now I want to garner discussion.

I'm not sure how disallowing a reference for a map key would cause repetition and space wasting. Keys are generally small, so a reference would in many cases end up taking a similar amount of space. Can you talk more about allowing references as keys? I need different points of view NOW, before I finalize v1 of the format! The negative is that it complicates processing, since you could reference a non-keyable value and attempt to use it in a key.

Everybody PLEASE chime in with your thoughts on the format, because I want to make sure I get it as close to right as I can before I finalize V1!

pdimitar · on Oct 4, 2020

As a programmer who had to wrestle with like 30 formats during his career, here's one piece of non-negotiable requirement:

---------- NO. OPTIONAL. PARTS. OF. THE. STANDARD! (And no "but"-s.) ----------

Start there. This might sound flippant and snarky but I promise you that it isn't. Don't ever leave anything to the implementations.

Oh, and build your own compliance tool. So people can just run the results of their own implementations against your tool and see if their code is compliant with your data format or not.

tux3 · on Oct 4, 2020

Votes on comments are not public, so let me agree loudly: Many people will not want to use a format with implementation defined or optional behavior.

That way lies pain.

kstenerud · on Oct 4, 2020

Right. No optional stuff then!

grncdr · on Oct 4, 2020

Not the GP but regarding no references for map keys; I can imagine that this would be an issue when representing a list of maps with uniform keys. E.g. a list of 1000 points with the keys “latitude” and “longitude”.

I can think of various more compact ways of structuring the same data, but JSON + gzip encodes the naive structure in a very compact way with very little programmer effort.

In general, the design seems to miss a compact representation of an array of homogeneous records, which is a common use case IME.

Edit: Also I agree with GP regarding NUL and other comments about no optional stuff.

kstenerud · on Oct 4, 2020

Hmm yes I see how that could be useful as a kind of "pointer-to-key" setup. A key that is a ref to a non-keyable type would be invalid anyway...

I've added array types for arrays of common fixed-length types, but for records it gets complicated since it can only really work if all records are the same size. I could maybe expand array support to custom types for specialized applications that have a lot of record types (provided the custom record type is fixed size). But the primary purpose of the format is a way to allow disparate apps the ability to read each others data in a machine and human friendly way without requiring a bunch of extra pieces. A secondary concern is not trying to be everything for everyone, not complicating the format for too small a gain. But that's the trick, isn't it? (where to draw the line)

viraptor · on Oct 4, 2020

I was thinking of use cases like large slightly-heterogenous data dumps. Or structured logs. Let's say you have a few million entries, mostly with the same fields, but with enough differences that using a header + array of arrays is also not great.

But! This is totally a specific use case and you don't need to try make everyone happy. There's GELF and others for structured logs already.

sitkack · on Oct 4, 2020

Optionality is a bigger problem than people realize, these special cases force “conditional pollution”, and when the optionality is implemented with exceptions, you now have to understand non-local control flow.

Make everyone’s lives easier and make a spec where everything is mandatory.

ethbr0 · on Oct 5, 2020

No spec writer is omniscient.

I think the best way to deal with this is explicit forking / special-purpose subspec. Where again, in that spec, things are different / extended, but still mandatory. And if consumers fall into that usecase, they can explicitly advertise / consume that version.

9214 · on Oct 4, 2020

The basic premise (The friendly data format for human & machine) instantly reminded me of Red language [1], which, by the virtue of homoiconicity, is also its own data format (both textual and binary [2]) with 50 datatypes, many with several unique literal forms:

  date:      4-Oct-2020 4/11/20
  time:      13:04:07
  url:       https://concise-encoding.org
  email:     kstenerud@gmail.com
  ref:       @kstenerud
  tag:       <div color="red">
  integer:   123 
  float:     123.45 1.#NaN -1.#INF
  pair:      67x89
  percent:   9000%
  money:     USD$0.50
  binary:    #{deadbeef} 2#{00101101}
  container: [#hashtag "string" %file.ext #(key value)]
  ...

[1]: https://www.red-lang.org/p/about.html

[2]: https://github.com/red/docs/blob/master/en/redbin.adoc

spfzero · on Oct 4, 2020

Really interesting, thanks for posting this. Edited to add: interesting for the cross-platform and independence aspects.

kstenerud · on Oct 4, 2020

If you have ideas or comments on the design of the format, please let me know! I've been working on this for 3 years and am getting close to release, but there are always blind spots when working alone.

The basic premise is that you can never marry "easy to edit" with "efficient to process". They pull in opposing directions, so the only alternative is to have two 100% compatible formats: one in text and one in binary.

I've chosen the data types carefully to support the most common types that come up in real world situations (and have included custom type support for data not intended for public consumption). The compromises made should allow most people to work without adding extra encoding on top of the encoding (base64, special text parsing and such).

The reference implementation is working for all common cases, and mostly there for array types. Once that's finished, I'll start on the schema design.

carapace · on Oct 4, 2020

In a project like this I look for a "related work" section (that shows you know the context and are at least not reinventing [square] wheels and have some reason why your thing is different/better than all the other things that do what your thing does already) and a formal grammar (that I can feed to a parser generator etc.).

valw · on Oct 4, 2020

Are you aware of EDN, Transit or Fressian, which IMO improve a lot on the currently common approaches to encoding (JSON / YAML)? If so, how does Concise Encoding differentiate itself?

https://github.com/edn-format/edn

https://github.com/cognitect/transit-format

https://github.com/Datomic/fressian/wiki

kstenerud · on Oct 4, 2020

Yes, I've looked at EDN before. It's solving a similar problem, but not the same one. EDN is focused on extensibility, where as Concise Encoding is focused on concision, and binary-text compatibility.

Transit is a bridging format to other encoding formats, which is very cool, but a fairly different problem space.

Fressian is pretty close, but is binary only. The main purpose of Concise Encoding is binary-text compatible formats.

seanalltogether · on Oct 4, 2020

It's weird that you use a completely different annotation style for byte arrays. Why not just use

bytes = u8[10 ff 38 9a dd 00 4f 4f 91]

bb88 · on Oct 4, 2020

You need a fixed decimal float to represent currency.

2.25 or 1.10

kstenerud · on Oct 4, 2020

The decimal float type would handle that, with a schema to specify fractional precision.

mkesper · on Oct 4, 2020

Would readable large integers be possible? E.gPython. gathered support for numbers like 1_000_000_000

kstenerud · on Oct 4, 2020

Absolutely. Numeric data can be arbitrarily large, and supports _ as numeric whitespace in the text format.

mpweiher · on Oct 4, 2020

Glad to see more work in this space.

You might spend a little more saying what makes your approach unique:

"Use text based formats that are bloated and slow, or use binary formats that humans can't read. Wouldn't it be nice to have the benefits of both, and none of the drawbacks?"

NeXT/Apple property lists have had that feature for decades now, with multiple text and binary serialization formats.[1]

Infra [2] also makes very similar claims "Existing metaformats fit neatly into two categories. They are either textual for human-readability (such as XML and JSON) or binary for compact serialization (such as Thrift and Protocol Buffers). Infra can play the role of either, imbuing each with the desirable properties of the other."

[1] https://en.wikipedia.org/wiki/Property_list

[2] http://www.christopherkhall.com/research.html

kstenerud · on Oct 4, 2020

Property lists came very close, but still suffer from a number of issues:

- The binary format is not efficiently packed. For example, Concise Encoding uses 200 of the 256 codes to directly encode integers (-100 to 100) since they are the most common data values in the wild (along with short strings, which also have their own special encoding).

- There's no array type (as in contiguous objects of the same type and size)

- Dates are in seconds, not Gregorian fields, and have no time zones.

- Binary data in the text format is base64 encoded, ensuring that it's 100% unreadable in its raw form.

- Container types are prepended with a length field, making progressive construction impossible.

Infra looks interesting, but it strays from editable text into a kind of hybrid format that requires a specialized editor to read and write. This would only work if such editors became ubiquitous, which is unlikely. I ran down this path for awhile as well, but finally decided against it.

spacemanmatt · on Oct 4, 2020

How does this address format creep? Why is it going to be better to support all the existing formats and now this one too? Does it really add that much?

The first formats filled a void. This format joins many players already on the field.

kstenerud · on Oct 4, 2020

It's versioned, so if we come up with significant improvements later, we can bump the version.

This format fills two voids:

1. The lack of native types in most of the other formats

2. The lack of either editability or processing efficiency, depending on whether it's a binary or text format.

I developed this format because I'm tired of putting encodings on top of encodings (i.e. base64) and other such tricks, just to get my data across to the other side, and I want a general purpose data format that's efficient as well as readable.

jiehong · on Oct 4, 2020

CBOR is filling the same requirements and is isomorphic to JSON.

kstenerud · on Oct 4, 2020

Yes, and because of that it suffers the same downsides of JSON. You also can't just transparently convert CBOR to JSON because of the data type disparity.

x3ro · on Oct 4, 2020

I think it’s interesting that they included a markup type. Based on the spec it’s „similar to XML“, so I suppose this markup is of their own making.

It doesn’t seem like this would make transmitting XML or HTML significantly more efficient to transmit compared to just gzipping a string. And if you wanted to use this, you’d have to write the conversion first.

Can someone help me understand the use case for this, ie in which situations this would be super useful?

kstenerud · on Oct 4, 2020

The idea is to allow presentation data to be included inside of general data, and also to allow presentation data to contain any kind of general data (not just string keys and string values). At some point in the distant future, it would also be nice to have official numeric tag assignments in schemas (e.g. 1 = "View", 2 = "TextBox", etc) to further improve processing by eliminating the need for text parsing just to decide what kind of object to present. These would be transparently converted from the text-based names in the human-readable text format when converting to the machine-read binary format.

The overarching purpose of Concise Encoding is to bring forth the tools to make transmitted data accessible to humans, and at the same time efficient to process by machines.

wsargent · on Oct 4, 2020

How does this compare to Amazon Ion?

https://amzn.github.io/ion-docs/

kstenerud · on Oct 4, 2020

Ion almost got it right. The problems are:

- Binary format is big endian. All modern processors use little endian.

- The time format doesn't have a time zone.

- The binary time format is in UTC, while the text format is in local time, and you have to convert (WTF!)

- The binary time format is HUGE.

- Typed nulls seems a little excessive for very little gain.

- No efficient small integer encoding.

- Containers in the binary format are prepended by the element count, which makes progressive container filling impossible.

- No custom types (except maybe s-exps, which are overkill)

- No metadata

- No URI, UUID types

- no markup container

- no references

formerly_proven · on Oct 4, 2020

I really like the inclusion of hexfloats (beware of C++ iostream bugs, though), because they're wayyyyy easier to parse correctly than decimal floats (way faster, too). As an added bonus they're reasonably easy to read for developers, too, since they're just scientific notation in hex.

IshKebab · on Oct 4, 2020

Native support for complex types like URLs and UUIDs is a mistake. It just makes it more complicated and harder to interoperate with other formats like JSON. Look at all the archaic types ASN.1 supports.

kstenerud · on Oct 4, 2020

URIs and UUIDs are never going away. We're always going to need a way to lookup resources, and we're always going to need unique identifiers. To not include them is a mistake, because you then have to add a (potentially incompatible) extra layer of encoding at the application level just to store them in the encoding format.

ysleepy · on Oct 4, 2020

URL parsing is a nightmare. There are lots of RFCs, but the real world is full of quirks. This will lead to edge case differences between the implementations.

Also, which URI standard is part of the spec, what if the URI standard evolves? – It is a complex standard with multiple RFCs and a long history.

What of possible, different length restrictions in language native URI types?

I have no stake in this though, feel free to ignore. I tend to see negatives first.

kstenerud · on Oct 4, 2020

Yes, it's true that URLs are a nightmare, but we don't have anything better (yet). Once we do, I'll happily release v2 of the spec. For now, it follows the RFC.

benibela · on Oct 4, 2020

XML tried to do that with the anyURI data type

In XML/XSD 1.1 they gave up on it, and consider any string as valid anyURI

I tried to implement the old XML types. I built a huge regex from the RFC, but it did seem to cover all cases

kstenerud · on Oct 5, 2020

TBH at this point it doesn't really matter what the URI specifications say. Somehow we're able to stuff our URLs into our browsers, web pages, package managers, REST APIs, mail clients etc and manage to get it working. It's useful enough that everyone uses it, so I'd be a fool to get stuck on the "correctness" of their specs. CTE uses the double-quote as a delimiter, so as long as the contents are percent-escaped for double-quote and your language's URL parser says "cool", it's acceptable.

Aeolun · on Oct 4, 2020

Personally I think it’s nice, though I think they’re basically syntactic sugar for a string and byte array.

Does the format validate that a URL is actually a URL?

kstenerud · on Oct 5, 2020

There's nothing stopping someone from putting a URL validator into their codec, but it's probably a lot less work to just pass it to your runtime library's URL parser.

tasogare · on Oct 4, 2020

I disagree. Just because a common data format doesn’t support those types doesn’t mean other formats have to do the same mistake (and given that JSON lacks comments and an official formal spec, its design is questionable to begin with). If interoperability is needed with JSON or other formats not supporting date or uuid, such data type can be serialized as strings.

_fx6v · on Oct 4, 2020

Hmm seems like a battery packed version of bencoding but maybe someone can enlighten further?

kstenerud · on Oct 4, 2020

Although it uses text values for delimiters, Bencode isn't designed to be edited in text mode. If you want to transmit binary data that can be efficiently read by a machine, you cannot also make it human readable/editable - they have opposing goals. The only way to get both is to have a twin format (binary and text) that is 1:1 compatible and transparently convertible.

evnix · on Oct 4, 2020

how do you do multi line strings, the thing I don't like about YAML is there are 101 ways to do a multiline string.

kstenerud · on Oct 4, 2020

Two ways: Quoted string using backslash-newline (like in C), or using verbatim strings (like here documents in bash).

08-15 · on Oct 4, 2020

Seriously, another one? https://xkcd.com/927/

Why was CBOR not good enough this time? Still not enough string types? Oh, indeed, CBOR doesn't have markup, whatever that means.