summaryrefslogtreecommitdiff
path: root/src/content/blog/2020/08/31/database-i-wish-i-had.adoc
diff options
context:
space:
mode:
Diffstat (limited to 'src/content/blog/2020/08/31/database-i-wish-i-had.adoc')
-rw-r--r--src/content/blog/2020/08/31/database-i-wish-i-had.adoc257
1 files changed, 257 insertions, 0 deletions
diff --git a/src/content/blog/2020/08/31/database-i-wish-i-had.adoc b/src/content/blog/2020/08/31/database-i-wish-i-had.adoc
new file mode 100644
index 0000000..443a54e
--- /dev/null
+++ b/src/content/blog/2020/08/31/database-i-wish-i-had.adoc
@@ -0,0 +1,257 @@
+= The database I wish I had
+
+:empty:
+:values-talk: https://vimeo.com/230142234
+:haskell-startup: https://www.youtube.com/watch?v=ZR3Jirqk6W8
+
+I watched the talk "{values-talk}[Platform as a Reflection of Values: Joyent,
+Node.js and beyond]" by Bryan Cantrill, and I think he was able to put into
+words something I already felt for some time: if there's no piece of software
+out there that reflects your values, it's time for you to build that
+software{empty}footnote:talk-time[
+ At the very end, at time 29:49. When talking about the draft of this article
+ with a friend, he noted that Bryan O'Sullivan (a different Bryan) says a
+ similar thing on his talk "{haskell-startup}[Running a startup on Haskell]",
+ at time 4:15.
+].
+
+I kind of agree with what he said, because this is already happening to me. I
+long for a database with a certain set of values, and for a few years I was just
+waiting for someone to finally write it. After watching his talk, Bryan is
+saying to me: "time to stop waiting, and start writing it yourself".
+
+So let me try to give an overview of such database, and go over its values.
+
+== Overview
+
+I want a database that allows me to create decentralized client-side
+applications that can sync data.
+
+The best one-line description I can give right now is:
+
+____
+It's sort of like PouchDB, Git, Datomic, SQLite and Mentat.
+____
+
+A more descriptive version could be:
+
+____
+An embedded, immutable, syncable relational database.
+____
+
+Let's go over what I mean by each of those aspects one by one.
+
+=== Embedded
+
+:sqlite: https://sqlite.org/index.html
+:sqlite-whentouse: https://sqlite.org/whentouse.html
+:pouchdb: https://pouchdb.com/
+:couchdb: https://couchdb.apache.org/
+:mentat: https://github.com/mozilla/mentat
+:pouchdb-adapters: https://pouchdb.com/adapters.html
+:datomic-storage-services: https://docs.datomic.com/on-prem/storage.html
+:sqlite-amalgamation: https://www.sqlite.org/amalgamation.html
+
+I think the server-side database landscape is diverse and mature enough for my
+needs (even though I end up choosing SQLite most of the time), and what I'm
+after is a database to be embedded on client-side applications itself, be it
+desktop, browser, mobile, _etc._
+
+The purpose of such database is not to keep some local cache of data in case of
+lost connectivity: we have good solutions for that already. It should serve as
+the source of truth, and allow the application to work on top of it.
+
+{sqlite}[*SQLite*] is a great example of that: it is a very powerful relational
+database that runs {sqlite-whentouse}[almost anywhere]. What I miss from it
+that SQLite doesn't provide is the ability to run it on the browser: even though
+you could compile it to WebAssembly, [line-through]#it assumes a POSIX
+filesystem that would have to be emulated#[multiblock footnote omitted FIXME].
+
+{pouchdb}[*PouchDB*] is another great example: it's a full reimplementation of
+{couchdb}[CouchDB] that targets JavaScript environments, mainly the browser and
+Node.js. However I want a tool that can be deployed anywhere, and not limit its
+applications to places that already have a JavaScript runtime environment, or
+force the developer to bundle a JavaScript runtime environment with their
+application. This is true for GTK+ applications, command line programs, Android
+apps, _etc._
+
+{mentat}[*Mentat*] was an interesting project, but its reliance on SQLite makes
+it inherit most of the downsides (and benefits too) of SQLite itself.
+
+Having such a requirement imposes a different approach to storage: we have to
+decouple the knowledge about the intricacies of storage from the usage of
+storage itself, so that a module (say query processing) can access storage
+through an API without needing to know about its implementation. This allows
+the database to target a POSIX filesystems storage API and an IndexedDB storage
+API, and make the rest of the code agnostic about storage. PouchDB has such
+mechanism (called {pouchdb-adapters}[adapters]) and Datomic has them too (called
+{datomic-storage-services}[storage services]).
+
+This would allow the database to adapt to where it is embedded: when targeting
+the browser the IndexedDB storage API would provide the persistence layer that
+the database requires, and similarly the POSIX filesystem storage API would
+provide the persistence layer when targeting POSIX systems (like desktops,
+mobile, _etc._).
+
+But there's also an extra restriction that comes from by being embedded: it
+needs to provide and embeddable artifact, most likely a binary library object
+that exposes a C compatible FFI, similar to {sqlite-amalgamation}[how SQLite
+does]. Bundling a full runtime environment is possible, but doesn't make it a
+compelling solution for embedding. This rules out most languages, and leaves
+us with C, Rust, Zig, and similar options that can target POSIX systems and
+WebAssembly.
+
+=== Immutable
+
+:datomic: https://www.datomic.com/
+:day-of-datomic: https://vimeo.com/116315075
+:git: https://git-scm.com/
+:sqlite-limits: https://sqlite.org/limits.html
+:datomic-no-history: https://docs.datomic.com/cloud/best.html#nohistory-for-high-churn
+
+Being immutable means that only new information is added, no in-place update
+ever happens, and nothing is ever deleted.
+
+Having an immutable database presents us with similar trade-offs found in
+persistent data structures, like lack of coordination when doing reads, caches
+being always coherent, and more usage of space.
+
+{datomic}[*Datomic*] is the go to database example of this: it will only add
+information (datoms) and allows you to query them in a multitude of ways.
+Stuart Halloway calls it "accumulate-only" over
+"append-only"{empty}footnote:accumulate-only[
+ Video "{day-of-datomic}[Day of Datomic Part 2]" on Datomic's information
+ model, at time 12:28.
+]:
+
+____
+It's accumulate-only, it is not append-only. So append-only, most people when
+they say that they're implying something physical about what happens.
+____
+
+Also a database can be append-only and overwrite existing information with new
+information, by doing clean-ups of "stale" data. I prefer to adopt the
+"accumulate-only" naming and approach.
+
+{git}[*Git*] is another example of this: new commits are always added on top of
+the previous data, and it grows by adding commits instead of replacing existing
+ones.
+
+Git repositories can only grow in size, and that is not only an acceptable
+condition, but also one of the reasons to use it.
+
+All this means that no in-place updates happens on data, and the database will
+be much more concerned about how compact and efficiently it stores data than how
+fast it does writes to disk. Being embedded, the storage limitation is either
+a) how much storage the device has or b) how much storage was designed for the
+application to consume. So even though the database could theoretically operate
+with hundreds of TBs, a browser page or mobile application wouldn't have access
+to this amount of storage. SQLite even {sqlite-limits}[says] that it does
+support approximately 280 TBs of data, but those limits are untested.
+
+The upside of keeping everything is that you can have historical views of your
+data, which is very powerful. This also means that applications should turn
+this off when not
+relevant{empty}footnote:no-history[
+ Similar to {datomic-no-history}[Datomic's `:db/noHistory`].
+].
+
+=== Syncable
+
+:3-way-merge: https://en.wikipedia.org/wiki/Merge_(version_control)
+:git-remote-gcrypt: https://spwhitton.name/tech/code/git-remote-gcrypt/
+
+This is a frequent topic when talking about offline-first solutions. When
+building applications that:
+
+* can fully work offline,
+* stores data,
+* propagates that data to other application instances,
+
+then you'll need a conflict resolution strategy to handle all the situations
+where different application instances disagree. Those application instances
+could be a desktop and a browser version of the same application, or the same
+mobile app in different devices.
+
+A three-way merge seems to be the best approach, on top of which you could add
+application specific conflict resolution functions, like:
+
+* pick the change with higher timestamp;
+* if one change is a delete, pick it;
+* present the diff on the screen and allow the user to merge them.
+
+Some databases try to make this "easy", by choosing a strategy for you, but I've
+found that different applications require different conflict resolution
+strategies. Instead, the database should leave this up to the user to decide,
+and provide tools for them to do it.
+
+{3-way-merge}[*Three-way merges in version control*] are the best example,
+performing automatic merges when possible and asking the user to resolve
+conflicts when they appear.
+
+The unit of conflict for a version control system is a line of text. The
+database equivalent would probably be a single attribute, not a full entity or a
+full row.
+
+Making all the conflict resolution logic be local should allow the database to
+have encrypted remotes similar to how {git-remote-gcrypt}[git-remote-gcrypt]
+adds this functionality to Git. This would enable users to sync the application
+data across devices using an untrusted intermediary.
+
+=== Relational
+
+:datomic-datalog: https://docs.datomic.com/on-prem/query.html
+:datomic-model: https://docs.datomic.com/cloud/whatis/data-model.html#datoms
+
+I want the power of relational queries on the client applications.
+
+Most of the arguments against traditional table-oriented relational databases
+are related to write performance, but those don't apply here. The bottlenecks
+for client applications usually aren't write throughput. Nobody is interested
+in differentiating between 1 MB/s or 10 MB/s when you're limited to 500 MB
+total.
+
+The relational model of the database could either be based on SQL and tables
+like in SQLite, or maybe {datomic-datalog}[datalog] and {datomic-model}[datoms]
+like in Datomic.
+
+== From aspects to values
+
+Now let's try to translate the aspects above into values, as suggested by Bryan
+Cantrill.
+
+=== Portability
+
+Being able to target so many different platforms is a bold goal, and the
+embedded nature of the database demands portability to be a core value.
+
+=== Integrity
+
+When the local database becomes the source of truth of the application, it must
+provide consistency guarantees that enables applications to rely on it.
+
+=== Expressiveness
+
+The database should empower applications to slice and dice the data in any way
+it wants to.
+
+== Next steps
+
+Since I can't find any database that fits these requirements, I've finally come
+to terms with doing it myself.
+
+It's probably going to take me a few years to do it, and making it portable
+between POSIX and IndexedDB will probably be the biggest challenge. I got
+myself a few books on databases to start.
+
+I wonder if I'll ever be able to get this done.
+
+== External links
+
+:reddit: https://old.reddit.com/r/programming/comments/ijwz5b/the_database_i_wish_i_had/
+:lobsters: https://lobste.rs/s/m9vkg4/database_i_wish_i_had
+:hn: https://news.ycombinator.com/item?id=24337244
+:list: https://lists.sr.ht/~euandreh/public-inbox/%3C010101744a592b75-1dce9281-f0b8-4226-9d50-fd2c7901fa72-000000%40us-west-2.amazonses.com%3E
+
+See discussions on {reddit}[Reddit], {lobsters}[lobsters], {hn}[HN] and {list}[a
+lengthy email exchange].