aboutsummaryrefslogtreecommitdiff
path: root/_posts/2020-08-31-the-database-i-wish-i-had.md
diff options
context:
space:
mode:
Diffstat (limited to '_posts/2020-08-31-the-database-i-wish-i-had.md')
-rw-r--r--_posts/2020-08-31-the-database-i-wish-i-had.md295
1 files changed, 0 insertions, 295 deletions
diff --git a/_posts/2020-08-31-the-database-i-wish-i-had.md b/_posts/2020-08-31-the-database-i-wish-i-had.md
deleted file mode 100644
index 004a558..0000000
--- a/_posts/2020-08-31-the-database-i-wish-i-had.md
+++ /dev/null
@@ -1,295 +0,0 @@
----
-title: The database I wish I had
-date: 2020-08-31
-updated_at: 2020-09-03
-layout: post
-lang: en
-ref: the-database-i-wish-i-had
-category: mediator
----
-
-I watched the talk
-"[Platform as a Reflection of Values: Joyent, Node.js and beyond][platform-values]"
-by Bryan Cantrill, and I think he was able to put into words something I already
-felt for some time: if there's no piece of software out there that reflects your
-values, it's time for you to build that software[^talk-time].
-
-[platform-values]: https://vimeo.com/230142234
-[^talk-time]: At the very end, at time 29:49. When talking about the draft of
- this article with a friend, he noted that Bryan O'Sullivan (a different
- Bryan) says a similar thing on his talk
- "[Running a startup on Haskell](https://www.youtube.com/watch?v=ZR3Jirqk6W8)",
- at time 4:15.
-
-I kind of agree with what he said, because this is already happening to me. I
-long for a database with a certain set of values, and for a few years I was just
-waiting for someone to finally write it. After watching his talk, Bryan is
-saying to me: "time to stop waiting, and start writing it yourself".
-
-So let me try to give an overview of such database, and go over its values.
-
-## Overview
-
-I want a database that allows me to create decentralized client-side
-applications that can sync data.
-
-The best one-line description I can give right now is:
-
-> It's sort of like PouchDB, Git, Datomic, SQLite and Mentat.
-
-A more descriptive version could be:
-
-> An embedded, immutable, syncable relational database.
-
-Let's go over what I mean by each of those aspects one by one.
-
-### Embedded
-
-I think the server-side database landscape is diverse and mature enough for
-my needs (even though I end up choosing SQLite most of the time), and what I'm
-after is a database to be embedded on client-side applications itself, be it
-desktop, browser, mobile, etc.
-
-The purpose of such database is not to keep some local cache of data in case of
-lost connectivity: we have good solutions for that already. It should serve as
-the source of truth, and allow the application to work on top of it.
-
-[**SQLite**][sqlite] is a great example of that: it is a very powerful
-relational database that runs [almost anywhere][sqlite-whentouse]. What I miss
-from it that SQLite doesn't provide is the ability to run it on the browser:
-even though you could compile it to WebAssembly, ~~it assumes a POSIX filesystem
-that would have to be emulated~~[^posix-sqlite].
-
-[sqlite]: https://sqlite.org/index.html
-[sqlite-whentouse]: https://sqlite.org/whentouse.html
-[^posix-sqlite]: It was [pointed out to me](https://news.ycombinator.com/item?id=24338881)
- that SQLite doesn't assume the existence of a POSIX filesystem, as I wrongly
- stated. Thanks for the correction.
-
- This makes me consider it as a storage backend all by itself. I
- initially considered having an SQLite storage backend as one implementation
- of the POSIX filesystem storage API that I mentioned. My goal was to rely on
- it so I could validate the correctness of the actual implementation, given
- SQLite's robustness.
-
- However it may even better to just use SQLite, and get an ACID backend
- without recreating a big part of SQLite from scratch. In fact, both Datomic
- and PouchDB didn't create an storage backend for themselves, they just
- plugged on what already existed and already worked. I'm beginning to think
- that it would be wiser to just do the same, and drop entirely the from
- scratch implementation that I mentioned.
-
- That's not to say that adding an IndexedDB compatibility layer to SQLite
- would be enough to make it fit the other requirements I mention on this
- page. SQLite still is an implementation of a update-in-place, SQL,
- table-oriented database. It is probably true that cherry-picking the
- relevant parts of SQLite (like storage access, consistency, crash recovery,
- parser generator, etc.) and leaving out the unwanted parts (SQL, tables,
- threading, etc.) would be better than including the full SQLite stack, but
- that's simply an optimization. Both could even coexist, if desired.
-
- SQLite would have to be treated similarly to how Datomic treats SQL
- databases: instead of having a table for each entities, spread attributes
- over the tables, etc., it treats SQL databases as a key-value storage so it
- doesn't have to re-implement interacting with the disk that other databases
- do well.
-
- The tables would contain blocks of binary data, so there isn't a difference
- on how the SQLite storage backend behaves and how the IndexedDB storage
- backend behaves, much like how Datomic works the same regardless of the
- storage backend, same for PouchDB.
-
- I welcome corrections on what I said above, too.
-
-[**PouchDB**][pouchdb] is another great example: it's a full reimplementation of
-[CouchDB][couchdb] that targets JavaScript environments, mainly the browser and
-Node.js. However I want a tool that can be deployed anywhere, and not limit its
-applications to places that already have a JavaScript runtime environment, or
-force the developer to bundle a JavaScript runtime environment with their
-application. This is true for GTK+ applications, command line programs, Android
-apps, etc.
-
-[pouchdb]: https://pouchdb.com/
-[couchdb]: https://couchdb.apache.org/
-
-[**Mentat**][mentat] was an interesting project, but its reliance on SQLite
-makes it inherit most of the downsides (and benefits too) of SQLite itself.
-
-[mentat]: https://github.com/mozilla/mentat
-
-Having such a requirement imposes a different approach to storage: we have to
-decouple the knowledge about the intricacies of storage from the usage of
-storage itself, so that a module (say query processing) can access storage
-through an API without needing to know about its implementation. This allows
-the database to target a POSIX filesystems storage API and an IndexedDB storage
-API, and make the rest of the code agnostic about storage. PouchDB has such
-mechanism (called [adapters][pouchdb-adapters]) and Datomic has them too (called
-[storage services][datomic-storage-services]).
-
-[pouchdb-adapters]: https://pouchdb.com/adapters.html
-[datomic-storage-services]: https://docs.datomic.com/on-prem/storage.html
-
-This would allow the database to adapt to where it is embedded: when targeting
-the browser the IndexedDB storage API would provide the persistence layer
-that the database requires, and similarly the POSIX filesystem storage API would
-provide the persistence layer when targeting POSIX systems (like desktops,
-mobile, etc.).
-
-But there's also an extra restriction that comes from by being embedded: it
-needs to provide and embeddable artifact, most likely a binary library object
-that exposes a C compatible FFI, similar to
-[how SQLite does][sqlite-amalgamation]. Bundling a full runtime environment is
-possible, but doesn't make it a compelling solution for embedding. This rules
-out most languages, and leaves us with C, Rust, Zig, and similar options that
-can target POSIX systems and WebAssembly.
-
-[sqlite-amalgamation]: https://www.sqlite.org/amalgamation.html
-
-### Immutable
-
-Being immutable means that only new information is added, no in-place update
-ever happens, and nothing is ever deleted.
-
-Having an immutable database presents us with similar trade-offs found in
-persistent data structures, like lack of coordination when doing reads, caches
-being always coherent, and more usage of space.
-
-[**Datomic**][datomic] is the go to database example of this: it will only add
-information (datoms) and allows you to query them in a multitude of ways. Stuart
-Halloway calls it "accumulate-only" over "append-only"[^accumulate-only]:
-
-> It's accumulate-only, it is not append-only. So append-only, most people when
-> they say that they're implying something physical about what happens.
-
-[datomic]: https://www.datomic.com/
-[^accumulate-only]: Video "[Day of Datomic Part 2](https://vimeo.com/116315075)"
- on Datomic's information model, at time 12:28.
-
-Also a database can be append-only and overwrite existing information with new
-information, by doing clean-ups of "stale" data. I prefer to adopt the
-"accumulate-only" naming and approach.
-
-[**Git**][git] is another example of this: new commits are always added on top
-of the previous data, and it grows by adding commits instead of replacing
-existing ones.
-
-[git]: https://git-scm.com/
-
-Git repositories can only grow in size, and that is not only an acceptable
-condition, but also one of the reasons to use it.
-
-All this means that no in-place updates happens on data, and the database will
-be much more concerned about how compact and efficiently it stores data than how
-fast it does writes to disk. Being embedded, the storage limitation is either a)
-how much storage the device has or b) how much storage was designed for the
-application to consume. So even though the database could theoretically operate
-with hundreds of TBs, a browser page or mobile application wouldn't have access
-to this amount of storage. SQLite even [says][sqlite-limits] that it does
-support approximately 280 TBs of data, but those limits are untested.
-
-The upside of keeping everything is that you can have historical views of your
-data, which is very powerful. This also means that applications should turn this
-off when not relevant[^no-history].
-
-[sqlite-limits]: https://sqlite.org/limits.html
-[^no-history]: Similar to
- [Datomic's `:db/noHistory`](https://docs.datomic.com/cloud/best.html#nohistory-for-high-churn).
-
-### Syncable
-
-This is a frequent topic when talking about offline-first solutions. When
-building applications that:
-
-- can fully work offline,
-- stores data,
-- propagates that data to other application instances,
-
-then you'll need a conflict resolution strategy to handle all the situations
-where different application instances disagree. Those application instances
-could be a desktop and a browser version of the same application, or the same
-mobile app in different devices.
-
-A three-way merge seems to be the best approach, on top of which you could add
-application specific conflict resolution functions, like:
-
-- pick the change with higher timestamp;
-- if one change is a delete, pick it;
-- present the diff on the screen and allow the user to merge them.
-
-Some databases try to make this "easy", by choosing a strategy for you, but I've
-found that different applications require different conflict resolution
-strategies. Instead, the database should leave this up to the user to decide,
-and provide tools for them to do it.
-
-[**Three-way merges in version control**][3-way-merge] are the best example,
-performing automatic merges when possible and asking the user to resolve
-conflicts when they appear.
-
-The unit of conflict for a version control system is a line of text. The
-database equivalent would probably be a single attribute, not a full entity or a
-full row.
-
-Making all the conflict resolution logic be local should allow the database to
-have encrypted remotes similar to how [git-remote-gcrypt][git-remote-gcrypt]
-adds this functionality to Git. This would enable users to sync the application
-data across devices using an untrusted intermediary.
-
-[3-way-merge]: https://en.wikipedia.org/wiki/Merge_(version_control)
-[git-remote-gcrypt]: https://spwhitton.name/tech/code/git-remote-gcrypt/
-
-### Relational
-
-I want the power of relational queries on the client applications.
-
-Most of the arguments against traditional table-oriented relational databases
-are related to write performance, but those don't apply here. The bottlenecks
-for client applications usually aren't write throughput. Nobody is interested in
-differentiating between 1 MB/s or 10 MB/s when you're limited to 500 MB total.
-
-The relational model of the database could either be based on SQL and tables
-like in SQLite, or maybe [datalog][datalog] and [datoms][datoms] like in
-Datomic.
-
-[datalog]: https://docs.datomic.com/on-prem/query.html
-[datoms]: https://docs.datomic.com/cloud/whatis/data-model.html#datoms
-
-## From aspects to values
-
-Now let's try to translate the aspects above into values, as suggested by Bryan
-Cantrill.
-
-### Portability
-
-Being able to target so many different platforms is a bold goal, and the
-embedded nature of the database demands portability to be a core value.
-
-### Integrity
-
-When the local database becomes the source of truth of the application, it must
-provide consistency guarantees that enables applications to rely on it.
-
-### Expressiveness
-
-The database should empower applications to slice and dice the data in any way
-it wants to.
-
-## Next steps
-
-Since I can't find any database that fits these requirements, I've finally come
-to terms with doing it myself.
-
-It's probably going to take me a few years to do it, and making it portable
-between POSIX and IndexedDB will probably be the biggest challenge. I got myself
-a few books on databases to start.
-
-I wonder if I'll ever be able to get this done.
-
-## External links
-
-See discussions on [Reddit][reddit], [lobsters][lobsters], [HN][hn] and
-[a lengthy email exchange][lengthy-email].
-
-[reddit]: https://www.reddit.com/r/programming/comments/ijwz5b/the_database_i_wish_i_had/
-[lobsters]: https://lobste.rs/s/m9vkg4/database_i_wish_i_had
-[hn]: https://news.ycombinator.com/item?id=24337244
-[lengthy-email]: https://lists.sr.ht/~euandreh/public-inbox/%3C010101744a592b75-1dce9281-f0b8-4226-9d50-fd2c7901fa72-000000%40us-west-2.amazonses.com%3E