Today, I'm announcing the release of my library cdson: a parser and serializer for the DSON data format in C. (As the name suggests, DSON is a bit like JSON, and I strongly prefer its usage to YAML.) While I'm many years late to this joke, in that time somehow no one had implemented a DSON library in C.

DSON is not really (I think) intended to be taken seriously, so for my own amusement, I'm going to take it seriously. So, here's my feedback on the spec.

  • No specific character encoding is mandated for the text format.
  • The closest we get is "Unicode", but that refers to several encodings, and also is only referring to the string objects. And given the previous point, that means encodings can technically be mixed.
  • The term "Control Character" is neither defined nor a term of the art.
  • Some characters are permitted in both escaped and non-escaped form (e.g., '/', unless '/' somehow qualifies as a control character.).
  • Usage of the very construct is problematic for serialization code (which is probbly base-2 internally, not base-8, and therefore loses precision).
  • Permitting both very and VERY contradicts earlier in the spec stating "All keywords used by DSON are case-sensitive and must be in lower case.".
  • Quoth the spec "Whitespace can be inserted between any pair of tokens.", but "token" is not defined, and there isn't a clear way to define it that doesn't make parsing ambiguous.
  • Another consequence of whitespace being technically optional is that a fully compliant parser cannot perform lexing and parsing separately.
  • A further consequence of that is that technically leading and trailing whitespace are not permitted.
  • DSON inherits the JSON problem of prohibiting trailing delimiters within larger structures.
  • No support for comments. The best candidate is probably DogeScript's quiet/loud. shh would work too but is more work for the parser, and risks me typoing it as ssh.
  • \u-escapes specifify 18 bits of data per escape.

I have thought way too much about what to do with these 18 bits. For completeness, here's what all other extant DSON parsers do:

  • implement all escapes except \u. Defensible, but noncompliant.
  • ignore that escapes exist at all.
  • pass the issue to a JSON parser. This is invalid for a bunch of reasons, but for parsing this particular case, JSON escapes are 4 bytes of hex - which is only 16 bits total per escape, while we're holding 18. Additionally, the json.org spec (which the DSON spec is based on) doesn't indicate what to do with these 16 bits either - and fixing undefined behavior with more undefined behavior isn't the way to go.
  • handle it as a JSON-esque surrogate pair. 18 is less than the 20 bytes needed to do this, and the escapes aren't required to appear in pairs anyway.
  • assume that "Unicode" means UTF-16 and treat as a UCS-2 encoded character. Eww - and also this only makes sense if the entire document is already UTF-16, which is not guaranteed by the spec (nor do I want to deal with it).
  • treat as raw code point and convert to UTF-8.

cdson takes this last route. 18 bits of code point is planes 1-3, which is actually everything except private use and alternate reps right now. But it also gates using \u-escapes at all behind a flag.

Writing cdson has amused me, but having finished the project does not mean the amusement must cease. cdson is open source software under the very permissive MPL; feel free to add it to projects if doing so would amuse you too. And if you need a defensible config file format, might I recommend anything that's not YAML?