A fast EDN (Extensible Data Notation) reader written in C11 with SIMD boost

94 points by delaguardo 2 days ago

Jeaye a day ago

This is superb. Thank you for making it and licensing it MIT. I think this is a contender to replace the lexer within jank. I'll do some benchmarking next year and we'll see!

delaguardo a day ago

Wow, that is a greate news!) Thanks for looking at it from this perspective! There are some benchmarks already available in the project - https://github.com/DotFox/edn.c/blob/main/bench/bench_integr...
you can run it locally with `make bench bench-clj bench-wasm`
Let me know if I can do anything to help you with support in jank.
- Jeaye 16 hours ago
  
  It looks like the key missing part which would be needed for a lexer is source information (bare minimum: byte offset and size). I don't think edn.c can be used as a lexer without that, since error reporting requires accurate source information.
  As a side note, I'm curious how much AI was used in the creation of edn.c. These days, I like to get a measure of that for every library I use.
  - delaguardo 14 hours ago
    
    It should be easy to add source info for every token, some of them already keep both (size and offset) I can create a branch for that.
    > I'm curious how much AI was used in the creation of edn.c
    A fair amount. This is my first big public project written in pure C. I did consult LLM about best practices for code organisation, memory management, difference in SIMD instructions between platforms, etc. All the things Clojure developer typically don't think about (luxury of a hosted language). Ultimately, the goal was to learn some part of C programming, working reader is a side effect of that.
    > These days, I like to get a measure of that for every library I use.
    Btw, I'm curious, what kind of measuring you are looking for?
drob518 a day ago

Oooo that’d be nice.

exceptione a day ago

Interesting, I had to look up what EDN is. Important to note that EDN doesn't have a concept of a schema like JSON Schema.

This is a `map`, which bears semblence with a Json object. The following might look like an incorrect paylood, but will actually parse as valid EDN:

  {:a 1, "foo" :bar, [1 2 3] four}
  // Note that keys and values can be elements of any type. 
  // The use of commas above is optional, as they are parsed as whitespace.

If one wants to exchange complex data structures, Aterm is also an option: https://homepages.cwi.nl/~daybuild/daily-books/technology/at...

Some projects in Haskell use Aterms, as it is suitable for exchanging Sum and Product types.

delaguardo a day ago

One of a key design principles in EDN is to be exclusively data exchange format. Which is true even for JSON where json-schema is something that sits on top of JSON itself. Same goes to EDN - in Clojure there is clojure.spec that adds schema like notation, validation rules and conformation. https://clojure.org/about/spec , something like this could be implemented in other languages as well.
fulafel 17 hours ago

JSON doesn't have schemas either, JSON Schema is just a separate schema spec that happens to build on JSON, but you might be using for example Zod instead of that. Similarly systems that consume EDN can have various schema systems. For example spec or malli in the Clojure world. (Or you could be using Zod with EDN, etc).

eliasdejong a day ago

EDN has extra features over JSON, but it is still a text format. This makes it quite inefficient compared to binary formats. EDN also has no builtin 'raw bytes' type.

I am working on a format consisting of serialized B-tree. It is essentially a dictionary, but serialized. This means you can traverse the structure and perform zero-copy lookups without parsing: https://github.com/fastserial/lite3

delaguardo a day ago

Thanks for the link!
Yes, EDN is a textual format intended to be human-readable. There is also a format called Transit used to serialise EDN elements. Unlike raw EDN, Transit is designed purely for program-to-program communication and drops human readability in favor of performance. It can encode data into either binary (MessagePack) or text (JSON), but in both cases, it preserves all EDN data types and originates from the Clojure language.
https://github.com/cognitect/transit-format
zzo38computer 15 hours ago

> EDN also has no builtin 'raw bytes' type.
That was my complaint too.
> I am working on a format consisting of serialized B-tree. It is essentially a dictionary, but serialized
I had wanted something a bit similar; a serialized B-tree (or a similar structure) but with only a 'raw bytes' type, for keys and values (I will use DER for the values; I have my own library to work with DER already), and the ability to easily find all records whose key matches a specified prefix.

sevensor a day ago

I don’t wish to pick on this post, it looks quite well done. However, in general, I have some doubts about data formats with typed primitives. JSON, TOML, ASN.1, what have you. There’s very little you can do with the data unless you apply a schema, so why decode before then? The schema tells you what type you need anyway, so why add syntax complexity if you have to double check the result of parsing?

zzo38computer 15 hours ago

I think it depends what you will intend to do with the data (which is true for all of the formats that you mentioned); not everyone will do the same thing with it even if it is the same file. It might be helpful to know from other programs that do not know this schema to be able to parse the data (not always the case when using IMPLICIT types in ASN.1, which is one reason to use EXPLICIT instead, although it has avantages and disadvantages compared with IMPLICIT; however, in DER all types will use the same framing allowing the framing to be parsed even if the specific type cannot be understood by the reader), and can also be used in case the schema is later extended to use types other than the ones that were originally expected. (I prefer to use ASN.1 DER in my stuff, although JSON and other formats are also used by other formats that were made by someone else)
- sevensor 9 hours ago
  
  > It might be helpful to know from other programs that do not know this schema to be able to parse the data
  OK that’s a really interesting question: if you’re interpreting a text without knowing what it’s about, having type information embedded in it could help clarify the writer’s intent? That seems reasonable. Have you done this?
delaguardo 21 hours ago

I can do a lot without applying schema at all. For that I only need handful of types defined in EDN specification and Clojure programming language.
- sevensor 18 hours ago
  Suppose you have the EDN text
  ( { :name "Fred" :age 35 } { :name 37 :age "Wilma" } )
  There's a semantic error here; the name and age fields have been swapped in the second element of the list. At some point, somebody has to check whether :name is a string and :age is a number. If your application is going to do that anyway, why do syntax typing? You might as well just try to construct a number from "Wilma" at the point where you know you need a number.
  Obviously I have an opinion here, but I'm putting it out there in the hope of being contradicted. The whole world seems to run on JSON, and I'm struggling to understand how syntax typing helps with JSON document validation rather than needlessly complicating the syntax.
  - fulafel 7 hours ago
    
    I guess there are two questions: should the serialization format be coupled with the schema system, and should the serialization format have types.
    If you answer the first question with no, then the second question is revealed to just be about various considerations other than validation, such as legibility and obvious mapping to language types (such as having a common notation for symbols/keywords, sets, etc).
    JSON and EDN are similar here, if your comment was in context of JSON vs EDN difference. There's some incidental additional checking on the syntax level with EDN but that's not its purpouse.
    You can do interesting things with the data even if you don't parse/validate all of it.
    Eg an important feature of the spec schema system and philosophy is that you don't want closed specs, you want code to be able to handle and pass on data that is richer than what the code knows about, and if circumstances allow you shouldn't try to validate it in one place.
  - delaguardo 14 hours ago
    
    What do you mean under "syntax typing" and complications in the syntax?
    > The whole world seems to run on JSON
    That is true, and I don't like that :)
    From my perspective JSON syntax is too "light" and that translates to many complications typically in the form of convention: {"id": {"__MW__type": "LONG NUMBER", "value": "9999999999999999999999999"}}.
    
    sevensor 9 hours ago
    
    > convention: {"id": {"__MW__type": "LONG NUMBER", "value": "9999999999999999999999999"}}.
    Huh. I haven’t run into this, although I totally see the problem. It’s backdooring types into JSON that it doesn’t support. I agree JSON’s number types are weak; it’s been a source of real problems for me. Given that observation, you can go in two directions: have richer types, like EDN, or give up on types in JSON entirely, which is the alternative I’d propose. I need to put my money where my mouth is here and implement something to demonstrate what I’m talking about, but imagine if JSON didn’t have numbers at all. The receiver would have to convert values to numbers after decoding, but I’m arguing that’s fine because in practice you have to check the value’s type anyway before you can use it.
    When I say “syntax typing,” I mean that, for example, 31 is a number and “blue” is a string, and we know that because the string has quotation marks around it and the number is made of decimal digits.
    
    zzo38computer 9 hours ago
    
    > What do you mean under "syntax typing" and complications in the syntax?
    This question has been answered by someone else, but I have my own comments about this as well so I will write it also.
    EDN does complicate the syntax (so does XML, TER, various extensions of JSON, etc), but DER (and SDSER, if you want streaming) avoids this problem because the framing is the same for all data types, even though it has many different types and the encoding of the values of each type.
    > That [the whole world seems to run on JSON] is true, and I don't like that :)
    I agree with you (well, not everything but too many things); I don't like that either.
    > From my perspective JSON syntax is too "light" and that translates to many complications typically in the form of convention
    I agree with you about that too. In this case it is a number (there are problems with the numeric types in JSON), but there is also such things as: octet strings, date/time, non-Unicode text, etc.
    > I agree JSON’s number types are weak; it’s been a source of real problems for me. Given that observation, you can go in two directions: have richer types, like EDN, or give up on types in JSON entirely, which is the alternative I’d propose.
    Not only the number type (which is floating point so there is not a proper 64-bit or larger integer type, even though a integer type has been added into JavaScript after JSON was invented); the string type is also weak (since it cannot be arbitrary bytes), and so is the key/value list type (keys are only allowed to be strings and cannot be other types).
    There are other directions as well; I think DER does in between because of the same framing for all types even though there are many types (you do not have to use all of the types; it seems that some people don't like it apparently due to the expectation that you have to use all of the types, but that is wrong). (DER also has the advantage of the canonical form if you need it (DER is already canonical form); although there is a canonical form for JSON, it is a bit messy, and apparently the canonical form for numbers in JSON is complicated.)
    If you want to give up on types entirely, then why should you use JSON?

HexDecOctBin a day ago

Can the metadata feature be used to ergonomically emulate HTML attributes? It's not clear from the docs, and the spec doesn't seem to document the feature at all.

nerdponx a day ago

I'm not sure how the metadata syntax works, but you might not need it because you can do this:

  (html
    (head
      (title "Hello!"))
    (body
      (div
        (p
          "This is an example of a hyperlink: "
          (a "Example" :href "https://example.org/")))))

delaguardo a day ago

I think you can use metadata to model html attributes but in clojure people are using plain vector for that. https://github.com/weavejester/hiccup
tl;dr first element of the vector is a tag, second is a map of attributes test are children nodes:
[:h1 {:font-size "2em" :font-weight bold} "General Kenobi, you are a bold one"]

zzo38computer a day ago

I think it would be better to not use Unicode (so that you can use any character set), and to use "0o" instead of "0" prefix for octal numbers. Also, EDN seems to lack a proper format for binary data.

I think ASN.1 (and ASN.1X which is I added a few additional types such as key/value list and TRON string) is better. (I also made up a text-based ASN.1 format called TER which is intended to be converted to the binary DER format. It is also intended that extensions and subsets of TER can be made for specific applications if needed.) (I also wrote a DER decoder/encoder library in C, and programs that use that library, to convert TER to DER and to convert JSON to DER.)

ASN.1 (and ASN.1X) has many similar types than EDN, and a comparison can be made:

- Null (called "nil" in EDN) and booleans are available in ASN.1.

- Strings in ASN.1 are fortunately not limited to Unicode; you can also use ISO 2022, as well as octet strings and bit strings. However, there is no "single character" type.

- ASN.1 does have a Enumerated type, although the enumeration is made as numbers rather than as names. The EDN "keywords" type seems to be intended for enumerations.

- The integer and floating point types in ASN.1 are already arbitrary precision. If a reader requires a limited precision (e.g. 64-bits), it is easy to detect if it is out of range and result in an error condition.

- ASN.1 does not have a separate "list" and "vector" type, but does have a "set" type and a "sequence" type. A key/value list ("map") type is a nonstandard type in ASN.1X, but standard ASN.1 does not have a key/value list type.

- ASN.1 does have tagging, although its working is difference from EDN. ASN.1 does already have a date/time type though, so this extension is not needed. Extensions are possible by application types and private types, as well as by other methods such as External, Embedded PDV, and the nonstandard

- The rational number type (in edn.c but the main EDN specification does not seems to mention it), is not a standard type in ASN.1 but ASN.1X does have such a type.

(Some people complain that ASN.1 is complicated; this is not wrong, but you will only need to implement the parts that you will use (which is simpler when using DER rather than BER; I think BER is not very good and DER is much better), which ends up making it simpler while also capable of doing the things that would be desirable.)

(But, EDN does solve some of the problems with JSON, such as comments and a proper integer type.)

delaguardo a day ago

> EDN seems to lack a proper format for binary data
The best part of EDN that it is extendable :)
#binary/base64 "SGVsbG8sIHp6bzM4Y29tcHV0ZXIhIEhvdyBhcmUgeW91IGRvaW5nPw=="
This is a tagged literal that can be read by provided (if provided) custom reader during reading of the document. The result could be any type you want.
- zzo38computer 15 hours ago
  
  OK, this is possible, but it seems the type that ought to be a built-in type.
  Also, if there is not a binary file format for the data then you will need to always convert to/from base64 when working with this file whether or not you should need to.
  Furthermore, this does not work very well when you want to deal with character sets rather than binary data, since (as far as I can tell from the specification) the input will still need to be UTF-8 and follow the EDN syntax of an existing type.
  From what I can understand from the specification, the EDN decoder will still need to run and cannot be streamed if the official specification is used (which can make it inefficient), although it would probably be possible to make an implementation that can do this with streaming instead (but I don't know if the existing one does).
  So, the extensibility is still restricted. (In my opinion, ASN.1 (and ASN.1X) does it better.)
  - delaguardo 13 hours ago
    
    > From what I can understand from the specification, the EDN decoder will still need to run and cannot be streamed if the official specification is used
    Sorry, you understand it wrong
    There is no enclosing element at the top level. Thus edn is suitable for streaming and interactive applications.
    > but I don't know if the existing one does
    This implementation does not do streaming for now, but it understands a concept of "reading one complete" element from buffer. The only missing part is buffer managment.
    > So, the extensibility is still restricted.
    Could you explain how it is restricted if you are allowed to run whatever you want during reading of edn document? You can even do IO, no restrictions at all!
    Consider this:
    #init/postgres {:db-spec {:host "..." :port 54321 ,,,} :specs {:user ,,,}} [#user/id 1 #user/id 2 #user/id 3]
    This allows you to have a program that can lookup postgres database during reading of a document validating every returned object using provided spec (conforming the value)
    > In my opinion, ASN.1 (and ASN.1X) does it better.
    Please show how it does better. I'm very curious
    
    zzo38computer 13 hours ago
    
    I think you might have misunderstood what I meant, because I was unclear. I meant that it would have to decode the entire EDN string literal containing the base64 data before decoding the base64, not that it would have to decode the entire file before doing so. (I might still be wrong.)
    Specifically, I refer to what is quoted below:
    > Upon encountering a tag, the reader will first read the next element (which may itself be or comprise other tagged elements), then pass the result to the corresponding handler for further interpretation, and the result of the handler will be the data value yielded by the tag + tagged element, i.e. reading a tag and tagged element yields one value.
    > If a reader encounters a tag for which no handler is registered, the implementation can either report an error, call a designated 'unknown element' handler, or create a well-known generic representation that contains both the tag and the tagged element, as it sees fit. Note that the non-error strategies allow for readers which are capable of reading any and all edn, in spite of being unaware of the details of any extensions present.
    Due to these things, EDN does not have a proper "octet string" type, even if the extension is added.
    > This implementation does not do streaming for now, but it understands a concept of "reading one complete" element from buffer. The only missing part is buffer managment.
    OK, then it could be improved.
    > Could you explain how it is restricted if you are allowed to run whatever you want during reading of edn document? You can even do IO, no restrictions at all!
    Perhaps the above explains how it is restricted. It does not prevent you from looking up data in a database, etc; it is the data model of EDN itself which is restricted; it is not restricting what you do with it.
    > Please show how it does better. I'm very curious
    Since all types using the same framing, you can do "lazy decoding" if appropriate (you can also use custom decoders in any part of the file, and this can depend on the schema), and ASN.1 does have a built-in octet string type (as well as bit string, unrestricted character string, etc), and you can add implicit or explicit tagging (I prefer to use implicit if the underlying type is sequence or octet string, and explicit otherwise), as well as types such as External (and the nonstandard ASN1_IDENTIFIED_DATA type), you can easily define any type and can easily skip past any field of any type.
    > #init/postgres {:db-spec {:host "..." :port 54321 ,,,} :specs {:user ,,,}} [#user/id 1 #user/id 2 #user/id 3]
    Even with TER (the below does not use any extensions to TER itself, but extensions to TER are also possible; even if not, whoever reads the resulting DER can handle the application-specific types as needed), you can:
    [ [P:(database.example) 54321] [0A:1 0A:2 0A:3] ]
    In this case, the "0A:" prefix means application type 0, which has a meaning specific to the application; presumably for this application, application type 0 would correspond to user IDs. This example uses implicit types for the user IDs; if you want explicit types instead, then you can write:
    [ [P:(database.example) 54321] [0A[1] 0A[2] 0A[3]] ]
    Or, if you want to extend TER instead, then you might define your own keyword, e.g. "userid{1}" instead of "0A:1" or "0A[1]".
    (TER is not one of the official ASN.1 formats; it is one that I invented for the purpose of having a text format for ASN.1 which can then be converted to DER; most programs would be expected to use DER rather than TER.)
    
    delaguardo 12 hours ago
    
    > I meant that it would have to decode the entire EDN string literal containing the base64 data before decoding the base64
    yes, any edn reader implementation will read the complete base64 string from the example before giving this string to a custom reader. I understand now what you explain. However, I don't know what I can do about it. I use edn daily, it works great to me, and I have no immediate plans to replace it with something else.
    Anyway, the example you shared looks interesting, I'll definitely read more about it. Thank you.

Hammershaft a day ago

I'm grateful for this! Love seeing EDN find its way into new places.

huahaiy a day ago

Very nice. Is there a plan to have an EDN writer in C as well?

delaguardo a day ago

Yes, plan is there but didn't have time yet. Most likely will be available next week
- huahaiy 21 hours ago
  
  Wonderful. Looking forward to it.

medv 2 days ago

A very impressinve implementation with SIMD and WASM!

pgoudjo a day ago

[dead]