Durable persistent trees and parser combinators - building a database
Updated on February 9, 2021
I’ve received with certain frequency messages from people wanting to know if I’ve made any progress on the database project I’ve written about.
There are a few areas where I’ve made progress, and here’s a public post on it.
The main thing I wanted to validate with a concrete implementation was the concept of modeling a DAG on a sequence of datoms.
The notion of a datom is a rip-off from Datomic, which models data with time aware facts, which come from RDF. RDF’s fact is a triple of subject-predicate-object, and Datomic’s datoms add a time component to it: subject-predicate-object-time, A.K.A. entity-attribute-value-transaction:
The above datoms say:
- at time 0,
- at time 1,
personstopped liking pizza, and started to like bread.
Datomic ensures total consistency of this ever growing log by having a single writer, the transactor, that will enforce it when writing.
In order to support disconnected clients, I needed a way to allow multiple writers, and I chose to do it by making the log not a list, but a directed acyclic graph (DAG):
The extra datoms above add more information to build the directionality to the log, and instead of a single consistent log, the DAG could have multiple leaves that coexist, much like how different Git branches can have different “latest” commits.
In order to validate this idea, I started with a Clojure implementation. The goal was not to write the actual final code, but to make a proof-of-concept that would allow me to test and stretch the idea itself.
This code already exists, but is yet fairly incomplete:
- the building of the index isn’t done yet (with some commented code on the next step to be implemented)
- the indexing is extremely inefficient, with more
than one occurrence of
- no query support yet.
However, as time passed and I started looking at what the final implementation would look like, I started to consider keeping the PoC around.
The top-down approach (Clojure PoC) was in fact helping guide me with the bottom-up, and I now have “promoted” the Clojure PoC into a “reference implementation”. It should now be a finished implementation that says what the expected behaviour is, and the actual code should match the behaviour.
The good thing about a reference implementation is that it has no performance of resources boundary, so if it ends up being 1000x slower and using 500× more memory, it should be find. The code can be also 10x or 100x simpler, too.
In promoting the PoC into a reference implementation, this top-down approach now needs to go beyond doing everything in memory, and the index data structure now needs to be disk-based.
Roughly speaking, most storage engines out there are based either on B-Trees or LSM Trees, or some variations of those.
But when building an immutable database, update-in-place B-Trees aren’t an option, as it doesn’t accommodate keeping historical views of the tree. LSM Trees may seem a better alternative, but duplication on the files with compaction are also ways to delete old data which is indeed useful for a historical view.
I think the thing I’m after is a mix of a Copy-on-Write B-Tree, which would keep historical versions with the write IO cost amortization of memtables of LSM Trees. I don’t know of any B-Tree variant out there that resembles this, so I’ll call it “Flushing Copy-on-Write B-Tree”.
I haven’t written any code for this yet, so all I have is a high-level view of what it will look like:
like Copy-on-Write B-Trees, changing a leaf involves creating a new leaf and building a new path from root to the leaf. The upside is that writes a lock free, and no coordination is needed between readers and writers, ever;
the downside is that a single leaf update means at least
Hnew nodes that will have to be flushed to disk, where
His the height of the tree. To avoid that, the writer creates these nodes exclusively on the in-memory memtable, to avoid flushing to disk on every leaf update;
a background job will consolidate the memtable data every time it hits X MB, and persist it to disk, amortizing the cost of the Copy-on-Write B-Tree;
readers than will have the extra job of getting the latest relevant disk-resident value and merge it with the memtable data.
The key difference to existing Copy-on-Write B-Trees is that the new trees are only periodically written to disk, and the intermediate values are kept in memory. Since no node is ever updated, the page utilization is maximum as it doesn’t need to keep space for future inserts and updates.
And the key difference to existing LSM Trees is that no compaction is run: intermediate values are still relevant as the database grows. So this leaves out tombstones and value duplication done for write performance.
One can delete intermediate index values to reclaim space, but no data is lost on the process, only old B-Tree values. And if the database ever comes back to that point (like when doing a historical query), the B-Tree will have to be rebuilt from a previous value. After all, the database is a set of datoms, and everything else is just derived data.
Right now I’m still reading about other data structures that storage engines use, and I’ll start implementing the “Flushing Copy-on-Write B-Tree” as I learn more1 and mature it more.
I chose Rust as it has the best WebAssembly tooling support.
My goal is not to build a Rust database, but a database that happens to be in Rust. In order to reach client platforms, the primary API is the FFI one.
I’m not very happy with current tools for exposing Rust code via FFI to the external world: they either mix C with C++, which I don’t want to do, or provide no access to the intermediate representation of the FFI, which would be useful for generating binding for any language that speaks FFI.
I like better the path that the author of cbindgen
crate proposes: emitting an data representation of the Rust C API
(the author calls is a
ffi.json file), and than building transformers from the
data representation to the target language. This way you could generate a C API
So the first thing to be done before moving on is an FFI exporter that doesn’t
mix C and C++, and generates said
ffi.json, and than build a few transformers
that take this
ffi.json and generate the language bindings, be it C, C++,
I think the best way to get there is by taking the existing code for cbindgen, which uses the syn crate to parse the Rust code3, and adapt it to emit the metadata.
I’ve started a fork of cbindgen:
x-bindgen4. Right now it is
just a copy of cbindgen verbatim, and I plan to remove all C and C++ emitting
code from it, and add a IR emitting code instead.
When starting working on x-bindgen, I realized I didn’t know what to look for in
a header file, as I haven’t written any C code in many years. So as I was
writing libedn, I didn’t know how to build a good C API to
expose. So I tried porting the code to C, and right now I’m working on building
a good C API for a JSON parser using parser combinators:
After “finishing” ParsecC I’ll have a good notion of what a good C API is, and I’ll have a better direction towards how to expose code from libedn to other languages, and work on x-bindgen then.
What both libedn and ParsecC are missing right now are proper error reporting, and property-based testing for libedn.
I’ve learned a lot already, and I feel the journey I’m on is worth going through.
If any of those topics interest you, message me to discuss more or contribute! Patches welcome!
If you are interested in learning more about this too, the very best two resources on this subject are Andy Pavlo’s “Intro to Database Systems” course and Alex Petrov’s “Database Internals” book. ↩
As flawed as this may be, it seems to be generally acceptable and adopted, which works against building a solid ecosystem for Rust.
The alternative that rust-ffi implements relies on internals of the Rust compiler, which isn’t actually worst, just less common and less accepted. ↩
EDIT: now archived, the experimentation was fun. I’ve started to move more towards C, so this effort became deprecated. ↩
EDIT: now also archived. ↩