The database I wish I had
Updated on September 3, 2020
I watched the talk “Platform as a Reflection of Values: Joyent, Node.js and beyond” by Bryan Cantrill, and I think he was able to put into words something I already felt for some time: if there’s no piece of software out there that reflects your values, it’s time for you to build that software1.
I kind of agree with what he said, because this is already happening to me. I long for a database with a certain set of values, and for a few years I was just waiting for someone to finally write it. After watching his talk, Bryan is saying to me: “time to stop waiting, and start writing it yourself”.
So let me try to give an overview of such database, and go over its values.
I want a database that allows me to create decentralized client-side applications that can sync data.
The best one-line description I can give right now is:
It’s sort of like PouchDB, Git, Datomic, SQLite and Mentat.
A more descriptive version could be:
An embedded, immutable, syncable relational database.
Let’s go over what I mean by each of those aspects one by one.
I think the server-side database landscape is diverse and mature enough for my needs (even though I end up choosing SQLite most of the time), and what I’m after is a database to be embedded on client-side applications itself, be it desktop, browser, mobile, etc.
The purpose of such database is not to keep some local cache of data in case of lost connectivity: we have good solutions for that already. It should serve as the source of truth, and allow the application to work on top of it.
SQLite is a great example of that: it is a very powerful
relational database that runs almost anywhere. What I miss
from it that SQLite doesn’t provide is the ability to run it on the browser:
even though you could compile it to WebAssembly,
it assumes a POSIX filesystem
that would have to be emulated2.
Mentat was an interesting project, but its reliance on SQLite makes it inherit most of the downsides (and benefits too) of SQLite itself.
Having such a requirement imposes a different approach to storage: we have to decouple the knowledge about the intricacies of storage from the usage of storage itself, so that a module (say query processing) can access storage through an API without needing to know about its implementation. This allows the database to target a POSIX filesystems storage API and an IndexedDB storage API, and make the rest of the code agnostic about storage. PouchDB has such mechanism (called adapters) and Datomic has them too (called storage services).
This would allow the database to adapt to where it is embedded: when targeting the browser the IndexedDB storage API would provide the persistence layer that the database requires, and similarly the POSIX filesystem storage API would provide the persistence layer when targeting POSIX systems (like desktops, mobile, etc.).
But there’s also an extra restriction that comes from by being embedded: it needs to provide and embeddable artifact, most likely a binary library object that exposes a C compatible FFI, similar to how SQLite does. Bundling a full runtime environment is possible, but doesn’t make it a compelling solution for embedding. This rules out most languages, and leaves us with C, Rust, Zig, and similar options that can target POSIX systems and WebAssembly.
Being immutable means that only new information is added, no in-place update ever happens, and nothing is ever deleted.
Having an immutable database presents us with similar trade-offs found in persistent data structures, like lack of coordination when doing reads, caches being always coherent, and more usage of space.
Datomic is the go to database example of this: it will only add information (datoms) and allows you to query them in a multitude of ways. Stuart Halloway calls it “accumulate-only” over “append-only”3:
It’s accumulate-only, it is not append-only. So append-only, most people when they say that they’re implying something physical about what happens.
Also a database can be append-only and overwrite existing information with new information, by doing clean-ups of “stale” data. I prefer to adopt the “accumulate-only” naming and approach.
Git is another example of this: new commits are always added on top of the previous data, and it grows by adding commits instead of replacing existing ones.
Git repositories can only grow in size, and that is not only an acceptable condition, but also one of the reasons to use it.
All this means that no in-place updates happens on data, and the database will be much more concerned about how compact and efficiently it stores data than how fast it does writes to disk. Being embedded, the storage limitation is either a) how much storage the device has or b) how much storage was designed for the application to consume. So even though the database could theoretically operate with hundreds of TBs, a browser page or mobile application wouldn’t have access to this amount of storage. SQLite even says that it does support approximately 280 TBs of data, but those limits are untested.
The upside of keeping everything is that you can have historical views of your data, which is very powerful. This also means that applications should turn this off when not relevant4.
This is a frequent topic when talking about offline-first solutions. When building applications that:
- can fully work offline,
- stores data,
- propagates that data to other application instances,
then you’ll need a conflict resolution strategy to handle all the situations where different application instances disagree. Those application instances could be a desktop and a browser version of the same application, or the same mobile app in different devices.
A three-way merge seems to be the best approach, on top of which you could add application specific conflict resolution functions, like:
- pick the change with higher timestamp;
- if one change is a delete, pick it;
- present the diff on the screen and allow the user to merge them.
Some databases try to make this “easy”, by choosing a strategy for you, but I’ve found that different applications require different conflict resolution strategies. Instead, the database should leave this up to the user to decide, and provide tools for them to do it.
Three-way merges in version control are the best example, performing automatic merges when possible and asking the user to resolve conflicts when they appear.
The unit of conflict for a version control system is a line of text. The database equivalent would probably be a single attribute, not a full entity or a full row.
Making all the conflict resolution logic be local should allow the database to have encrypted remotes similar to how git-remote-gcrypt adds this functionality to Git. This would enable users to sync the application data across devices using an untrusted intermediary.
I want the power of relational queries on the client applications.
Most of the arguments against traditional table-oriented relational databases are related to write performance, but those don’t apply here. The bottlenecks for client applications usually aren’t write throughput. Nobody is interested in differentiating between 1 MB/s or 10 MB/s when you’re limited to 500 MB total.
The relational model of the database could either be based on SQL and tables like in SQLite, or maybe datalog and datoms like in Datomic.
Now let’s try to translate the aspects above into values, as suggested by Bryan Cantrill.
Being able to target so many different platforms is a bold goal, and the embedded nature of the database demands portability to be a core value.
When the local database becomes the source of truth of the application, it must provide consistency guarantees that enables applications to rely on it.
The database should empower applications to slice and dice the data in any way it wants to.
Since I can’t find any database that fits these requirements, I’ve finally come to terms with doing it myself.
It’s probably going to take me a few years to do it, and making it portable between POSIX and IndexedDB will probably be the biggest challenge. I got myself a few books on databases to start.
I wonder if I’ll ever be able to get this done.
See discussions on Reddit, lobsters, HN and a lengthy email exchange.
At the very end, at time 29:49. When talking about the draft of this article with a friend, he noted that Bryan O’Sullivan (a different Bryan) says a similar thing on his talk “Running a startup on Haskell”, at time 4:15. ↩
It was pointed out to me that SQLite doesn’t assume the existence of a POSIX filesystem, as I wrongly stated. Thanks for the correction.
This makes me consider it as a storage backend all by itself. I initially considered having an SQLite storage backend as one implementation of the POSIX filesystem storage API that I mentioned. My goal was to rely on it so I could validate the correctness of the actual implementation, given SQLite’s robustness.
However it may even better to just use SQLite, and get an ACID backend without recreating a big part of SQLite from scratch. In fact, both Datomic and PouchDB didn’t create an storage backend for themselves, they just plugged on what already existed and already worked. I’m beginning to think that it would be wiser to just do the same, and drop entirely the from scratch implementation that I mentioned.
That’s not to say that adding an IndexedDB compatibility layer to SQLite would be enough to make it fit the other requirements I mention on this page. SQLite still is an implementation of a update-in-place, SQL, table-oriented database. It is probably true that cherry-picking the relevant parts of SQLite (like storage access, consistency, crash recovery, parser generator, etc.) and leaving out the unwanted parts (SQL, tables, threading, etc.) would be better than including the full SQLite stack, but that’s simply an optimization. Both could even coexist, if desired.
SQLite would have to be treated similarly to how Datomic treats SQL databases: instead of having a table for each entities, spread attributes over the tables, etc., it treats SQL databases as a key-value storage so it doesn’t have to re-implement interacting with the disk that other databases do well.
The tables would contain blocks of binary data, so there isn’t a difference on how the SQLite storage backend behaves and how the IndexedDB storage backend behaves, much like how Datomic works the same regardless of the storage backend, same for PouchDB.
I welcome corrections on what I said above, too. ↩
Video “Day of Datomic Part 2” on Datomic’s information model, at time 12:28. ↩
Similar to Datomic’s