aboutsummaryrefslogtreecommitdiff
path: root/_posts/2020-08-31-the-database-i-wish-i-had.md
blob: 30a3f9b26a4b885aea86a9d85ccf9e5ddba65200 (about) (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: The database I wish I had
date: 2020-08-31
layout: post
lang: en
ref: the-database-i-wish-i-had
---

I watched the talk
"[Platform as a Reflection of Values: Joyent, Node.js and beyond][platform-values]"
by Bryan Cantrill, and I think he was able to put into words something I already
felt for some time: if there's no piece of software out there that reflects your
values, it's time for you to build that software[^talk-time].

[platform-values]: https://vimeo.com/230142234
[^talk-time]: At the very end, at time 29:49. When talking about the draft of
    this article with a friend, he noted that Bryan O'Sullivan (a different
    Bryan) says a similar thing on his talk
    "[Running a startup on Haskell](https://www.youtube.com/watch?v=ZR3Jirqk6W8)",
    at time FIXME.

I kind of agree with what he said, because this is already happening to me. I
long for a database with a certain set of values, and for a few years I was just
waiting for someone to finally write it. After watching his talk, Bryan is
saying to me: "time to stop waiting, and start writing it yourself".

So let me try to give an overview of such database, and go over it's values.

## Overview

I want a database that allows me to create decentralized client-side
applications that can sync data.

The best one-line description I can give right now is:

> It's sort of like PouchDB, Git, Datomic, SQLite and Mentat.

A more descriptive version could be:

> An embedded, immutable, syncable relational database.

Let's go over what I mean by each of those aspects one by one.

### Embedded

I think the server-side database landscape is diverse and mature enough for
my needs (even though I end up choosing SQLite most of the time), and what I'm
after is a database to be embedded on client-side applications itself, be it
desktop, browser, mobile, etc.

The purpose of such database is not to keep some local cache of data in case of
lost connectivity: we have good solutions for that already. It should serve as
the source of truth, and allow the application to work on top of it.

[**SQLite**][sqlite] is a great example of that: it is a very powerful
relational database that runs [almost anywhere][sqlite-whentouse]. What I miss
from it that SQLite doesn't provide is the ability to run it on the browser:
even though you could compile it to WebAssembly, it assumes a POSIX filesystem
that would have to be emulated.

[sqlite]: https://sqlite.org/index.html
[sqlite-whentouse]: https://sqlite.org/whentouse.html

[**PouchDB**][pouchdb] is another great example: it's a full reimplementation of
[CouchDB][couchdb] that targets JavaScript environments, mainly the browser and
Node.js. However I want a tool that can be deployed anywhere, and not limit it's
applications to places that already have a JavaScript runtime environment, or
force the developer to bundle a JavaScript runtime environment with their
application. This is true for GTK+ applications, command line programs, Android
apps, etc.

[pouchdb]: https://pouchdb.com/
[couchdb]: https://couchdb.apache.org/

[**Mentat**][mentat] was an interesting project, but it's reliance on SQLite
makes it inherit most of the downsides (and benefits too) of SQLite itself.

[mentat]: https://github.com/mozilla/mentat

Having such a requirement imposes a different approach to storage: we have to
decouple the knowledge about the intricansies of storage from the usage of
storage itself, so that a module (say query processing) can access storage
through an API without needing to know about it's implementation. This allows
the database to target a POSIX filesystems storage API and an IndexedDB storage
API, and make the rest of the code agnostic about storage. PouchDB has such
mechanism (called [adapters][pouchdb-adapters]) and Datomic has them too (called
[storage services][datomic-storage-services]).

[pouchdb-adapters]: https://pouchdb.com/adapters.html
[datomic-storage-services]: https://docs.datomic.com/on-prem/storage.html

This would allow the database to adapt to where it is embedded: when targetting
the browser the IndexedDB storage API would provide the persistence layer
that the database requires, and similarly the POSIX filesystem storage API would
provide the persistence layer when targetting POSIX systems (like desktops,
mobile, etc.).

But there's also an extra restriction that comes from by being embedded: it
needs to provide and embeddable artifact, most likely a binary library object
that exposes a C compatible FFI, similar to
[how SQLite does][sqlite-amalgamation]. Bundling a full runtime environment is
possible, but doesn't make it a compelling solution for embedding. This rules
out most languages, and leaves us with C, Rust, Zig, and similar options that
can target POSIX systems and WebAssembly.

[sqlite-amalgamation]: https://www.sqlite.org/amalgamation.html

### Immutable

Being immutable means that only new information is added, no in-place update
ever happens, and nothing is ever deleted.

Having an immutable database presents us with similar trade-offs found in
persistent data structures, like lack of coordination when doing reads, caches
being always coherent, and more usage of space.

[**Datomic**][datomic] is the go to database example of this: it will only add
information (datoms) and allows you to query them in a multitude of ways. Stuart
Halloway calls it "accumulate-only" over "append-only"[^accumulate-only]:

> It's accumulate-only, it is not append-only. So append-only, most people when
> they say that they're impliying something physical about what happens.

[datomic]: https://www.datomic.com/
[^accumulate-only]: Video "[Day of Datomic Part 2](https://vimeo.com/116315075)"
    on Datomic's infomation model, at time 12:28.

Also a database can be append-only and overwrite existing information with new
information, by doing clean-ups of "stale" data. I prefer to adopt the
"accumulate-only" naming and approach.

[**Git**][git] is another example of this: new commits are always added on top
of the previous data, and it grows by adding commits instead of replacing
existing ones.

[git]: https://git-scm.com/

Git repositories can only grow in size, and that is not only an acceptable
condition, but also one of the reasons to use it.

All this means that no in-place updates happens on data, and the database will
be much more concerned about how compact and efficiently it stores data than how
fast it does writes to disk. Being embedded, the storage limitation is either a)
how much storage the device has or b) how much storage was designed for the
application to consume. So even though the database could theoretically operate
with hundreds of TBs, a browser page or mobile application wouldn't have access
to this amount of storage. SQLite even [says][sqlite-limits] that it does
support approximately 280 TBs of data, but those limits are untested.

The upside of keeping everything is that you can have historical views of your
data, which is very powerful. This also means that applications should turn this
off when not relevant[^no-history].

[sqlite-limits]: https://sqlite.org/limits.html
[^no-history]: Similar to
    [Datomic's `:db/noHistory`](https://docs.datomic.com/cloud/best.html#nohistory-for-high-churn).

### Syncable

This is a frequent topic when talking about offline-first solutions. When
building applications that:

- can fully work offline,
- stores data,
- propagates that data to other application instances,

then you'll need a conflict resolution strategy to handle all the situations
where different application instances disagree. Those application instances
could be a desktop and a browser version of the same application, or the same
mobile app in different devices.

A three-way merge seems to be the best approach, on top of which you could add
application specific conflict resolution functions, like:

- pick the change with higher timestamp;
- if one change is a delete, pick it;
- present the diff on the screen and allow the user to merge them.

Some databases try to make this "easy", by choosing a strategy for you, but I've
found that different applications require different conflict resolution
strategies. Instead, the database should leave this up to the user to decide,
and provide tool for them to do it.

[**Three-way merges in version control**][3-way-merge] are the best example,
performing automatic merges when possible and asking the user to resolve
conflicts when they appear.

The unit of conflict for a version control system is a line of text. The
database equivalent would probably be a single attribute, not a full entity or a
full row.

Making all the conflict resolution logic be local should allow the database to
have encrypted remotes similar to how [git-remote-gcrypt][git-remote-gcrypt]
adds this functionality to Git. This would enable users to sync the application
data across devices using an untrusted intermediary.

[3-way-merge]: https://en.wikipedia.org/wiki/Merge_(version_control)
[git-remote-gcrypt]: https://spwhitton.name/tech/code/git-remote-gcrypt/

### Relational

I want the power of relational queries on the client applications.

Most of the arguments against traditional table-oriented relational databases
are related to write performance, but those don't apply here. The bottlenecks
for client applications usually aren't write throughput. Nobody is interested in
differentiating between 1 MB/s or 10 MB/s when you're limited to 500 MB total.

The relational model of the database could either be based on SQL and tables
like in SQLite, or maybe [datalog][datalog] and [datoms][datoms] like in
Datomic.

[datalog]: https://docs.datomic.com/on-prem/query.html
[datoms]: https://docs.datomic.com/cloud/whatis/data-model.html#datoms

## From aspects to values

Now let's try to translate the aspects above into values, as suggested by Bryan
Cantrill.

### Portability

Being able to target so many differente platforms is a bold goal, and the
embedded nature of the database demands portability to be a core value.

### Integrity

When the local database becomes the source of truth of the application, it must
provide consistency guarantees that enables applications to rely on it.

### Expressiveness

The database should empower applications to slice and dice the data in any way
it wants to.

## Next steps

Since I can't find any database that fits these requirements, I've finally come
to terms with doing it myself.

It's probably going to take me a few years to do it, and making it portable
between POSIX and IndexedDB will probably be the biggest challenge. I got myself
a few books on databases to start.

I wonder if I'll ever be able to get this done.