--- title: The database I wish I had date: 2020-08-31 layout: post lang: en ref: the-database-i-wish-i-had --- I watched the talk "[Platform as a Reflection of Values: Joyent, Node.js and beyond][platform-values]" by Bryan Cantrill, and I think he was able to put into words something I already felt for some time: if there's no piece of software out there that reflects your values, it's time for you to build that software[^talk-time]. [platform-values]: https://vimeo.com/230142234 [^talk-time]: At the very end, at time 29:49. When talking about the draft of this article with a friend, he noted that Bryan O'Sullivan (a different Bryan) says a similar thing on his talk "[Running a startup on Haskell](https://www.youtube.com/watch?v=ZR3Jirqk6W8)", at time FIXME. I kind of agree with what he said, because this is already happening to me. I long for a database with a certain set of values, and for a few years I was just waiting for someone to finally write it. After watching his talk, Bryan is saying to me: "time to stop waiting, and start writing it yourself". So let me try to give an overview of such database, and go over it's values. ## Overview I want a database that allows me to create decentralized client-side applications that can sync data. The best one-line description I can give right now is: > It's sort of like PouchDB, Git, Datomic, SQLite and Mentat. A more descriptive version could be: > An embedded, immutable, syncable relational database. Let's go over what I mean by each of those aspects one by one. ### Embedded I think the server-side database landscape is diverse and mature enough for my needs (even though I end up choosing SQLite most of the time), and what I'm after is a database to be embedded on client-side applications itself, be it desktop, browser, mobile, etc. The purpose of such database is not to keep some local cache of data in case of lost connectivity: we have good solutions for that already. It should serve as the source of truth, and allow the application to work on top of it. [**SQLite**][sqlite] is a great example of that: it is a very powerful relational database that runs [almost anywhere][sqlite-whentouse]. What I miss from it that SQLite doesn't provide is the ability to run it on the browser: even though you could compile it to WebAssembly, it assumes a POSIX filesystem that would have to be emulated. [sqlite]: https://sqlite.org/index.html [sqlite-whentouse]: https://sqlite.org/whentouse.html [**PouchDB**][pouchdb] is another great example: it's a full reimplementation of [CouchDB][couchdb] that targets JavaScript environments, mainly the browser and Node.js. However I want a tool that can be deployed anywhere, and not limit it's applications to places that already have a JavaScript runtime environment, or force the developer to bundle a JavaScript runtime environment with their application. This is true for GTK+ applications, command line programs, Android apps, etc. [pouchdb]: https://pouchdb.com/ [couchdb]: https://couchdb.apache.org/ [**Mentat**][mentat] was an interesting project, but it's reliance on SQLite makes it inherit most of the downsides (and benefits too) of SQLite itself. [mentat]: https://github.com/mozilla/mentat Having such a requirement imposes a different approach to storage: we have to decouple the knowledge about the intricansies of storage from the usage of storage itself, so that a module (say query processing) can access storage through an API without needing to know about it's implementation. This allows the database to target a POSIX filesystems storage API and an IndexedDB storage API, and make the rest of the code agnostic about storage. PouchDB has such mechanism (called [adapters][pouchdb-adapters]) and Datomic has them too (called [storage services][datomic-storage-services]). [pouchdb-adapters]: https://pouchdb.com/adapters.html [datomic-storage-services]: https://docs.datomic.com/on-prem/storage.html This would allow the database to adapt to where it is embedded: when targetting the browser the IndexedDB storage API would provide the persistence layer that the database requires, and similarly the POSIX filesystem storage API would provide the persistence layer when targetting POSIX systems (like desktops, mobile, etc.). But there's also an extra restriction that comes from by being embedded: it needs to provide and embeddable artifact, most likely a binary library object that exposes a C compatible FFI, similar to [how SQLite does][sqlite-amalgamation]. Bundling a full runtime environment is possible, but doesn't make it a compelling solution for embedding. This rules out most languages, and leaves us with C, Rust, Zig, and similar options that can target POSIX systems and WebAssembly. [sqlite-amalgamation]: https://www.sqlite.org/amalgamation.html ### Immutable Being immutable means that only new information is added, no in-place update ever happens, and nothing is ever deleted. Having an immutable database presents us with similar trade-offs found in persistent data structures, like lack of coordination when doing reads, caches being always coherent, and more usage of space. [**Datomic**][datomic] is the go to database example of this: it will only add information (datoms) and allows you to query them in a multitude of ways. Stuart Halloway calls it "accumulate-only" over "append-only"[^accumulate-only]: > It's accumulate-only, it is not append-only. So append-only, most people when > they say that they're impliying something physical about what happens. [datomic]: https://www.datomic.com/ [^accumulate-only]: Video "[Day of Datomic Part 2](https://vimeo.com/116315075)" on Datomic's infomation model, at time 12:28. Also a database can be append-only and overwrite existing information with new information, by doing clean-ups of "stale" data. I prefer to adopt the "accumulate-only" naming and approach. [**Git**][git] is another example of this: new commits are always added on top of the previous data, and it grows by adding commits instead of replacing existing ones. [git]: https://git-scm.com/ Git repositories can only grow in size, and that is not only an acceptable condition, but also one of the reasons to use it. All this means that no in-place updates happens on data, and the database will be much more concerned about how compact and efficiently it stores data than how fast it does writes to disk. Being embedded, the storage limitation is either a) how much storage the device has or b) how much storage was designed for the application to consume. So even though the database could theoretically operate with hundreds of TBs, a browser page or mobile application wouldn't have access to this amount of storage. SQLite even [says][sqlite-limits] that it does support approximately 280 TBs of data, but those limits are untested. The upside of keeping everything is that you can have historical views of your data, which is very powerful. This also means that applications should turn this off when not relevant[^no-history]. [sqlite-limits]: https://sqlite.org/limits.html [^no-history]: Similar to [Datomic's `:db/noHistory`](https://docs.datomic.com/cloud/best.html#nohistory-for-high-churn). ### Syncable This is a frequent topic when talking about offline-first solutions. When building applications that: - can fully work offline, - stores data, - propagates that data to other application instances, then you'll need a conflict resolution strategy to handle all the situations where different application instances disagree. Those application instances could be a desktop and a browser version of the same application, or the same mobile app in different devices. A three-way merge seems to be the best approach, on top of which you could add application specific conflict resolution functions, like: - pick the change with higher timestamp; - if one change is a delete, pick it; - present the diff on the screen and allow the user to merge them. Some databases try to make this "easy", by choosing a strategy for you, but I've found that different applications require different conflict resolution strategies. Instead, the database should leave this up to the user to decide, and provide tool for them to do it. [**Three-way merges in version control**][3-way-merge] are the best example, performing automatic merges when possible and asking the user to resolve conflicts when they appear. The unit of conflict for a version control system is a line of text. The database equivalent would probably be a single attribute, not a full entity or a full row. Making all the conflict resolution logic be local should allow the database to have encrypted remotes similar to how [git-remote-gcrypt][git-remote-gcrypt] adds this functionality to Git. This would enable users to sync the application data across devices using an untrusted intermediary. [3-way-merge]: https://en.wikipedia.org/wiki/Merge_(version_control) [git-remote-gcrypt]: https://spwhitton.name/tech/code/git-remote-gcrypt/ ### Relational I want the power of relational queries on the client applications. Most of the arguments against traditional table-oriented relational databases are related to write performance, but those don't apply here. The bottlenecks for client applications usually aren't write throughput. Nobody is interested in differentiating between 1 MB/s or 10 MB/s when you're limited to 500 MB total. The relational model of the database could either be based on SQL and tables like in SQLite, or maybe [datalog][datalog] and [datoms][datoms] like in Datomic. [datalog]: https://docs.datomic.com/on-prem/query.html [datoms]: https://docs.datomic.com/cloud/whatis/data-model.html#datoms ## From aspects to values Now let's try to translate the aspects above into values, as suggested by Bryan Cantrill. ### Portability Being able to target so many differente platforms is a bold goal, and the embedded nature of the database demands portability to be a core value. ### Integrity When the local database becomes the source of truth of the application, it must provide consistency guarantees that enables applications to rely on it. ### Expressiveness The database should empower applications to slice and dice the data in any way it wants to. ## Next steps Since I can't find any database that fits these requirements, I've finally come to terms with doing it myself. It's probably going to take me a few years to do it, and making it portable between POSIX and IndexedDB will probably be the biggest challenge. I got myself a few books on databases to start. I wonder if I'll ever be able to get this done.