1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
|
---
title: The database I wish I had
date: 2020-08-31
updated_at: 2020-09-03
layout: post
lang: en
ref: the-database-i-wish-i-had
category: mediator
---
I watched the talk
"[Platform as a Reflection of Values: Joyent, Node.js and beyond][platform-values]"
by Bryan Cantrill, and I think he was able to put into words something I already
felt for some time: if there's no piece of software out there that reflects your
values, it's time for you to build that software[^talk-time].
[platform-values]: https://vimeo.com/230142234
[^talk-time]: At the very end, at time 29:49. When talking about the draft of
this article with a friend, he noted that Bryan O'Sullivan (a different
Bryan) says a similar thing on his talk
"[Running a startup on Haskell](https://www.youtube.com/watch?v=ZR3Jirqk6W8)",
at time 4:15.
I kind of agree with what he said, because this is already happening to me. I
long for a database with a certain set of values, and for a few years I was just
waiting for someone to finally write it. After watching his talk, Bryan is
saying to me: "time to stop waiting, and start writing it yourself".
So let me try to give an overview of such database, and go over its values.
## Overview
I want a database that allows me to create decentralized client-side
applications that can sync data.
The best one-line description I can give right now is:
> It's sort of like PouchDB, Git, Datomic, SQLite and Mentat.
A more descriptive version could be:
> An embedded, immutable, syncable relational database.
Let's go over what I mean by each of those aspects one by one.
### Embedded
I think the server-side database landscape is diverse and mature enough for
my needs (even though I end up choosing SQLite most of the time), and what I'm
after is a database to be embedded on client-side applications itself, be it
desktop, browser, mobile, *etc.*
The purpose of such database is not to keep some local cache of data in case of
lost connectivity: we have good solutions for that already. It should serve as
the source of truth, and allow the application to work on top of it.
[**SQLite**][sqlite] is a great example of that: it is a very powerful
relational database that runs [almost anywhere][sqlite-whentouse]. What I miss
from it that SQLite doesn't provide is the ability to run it on the browser:
even though you could compile it to WebAssembly, ~~it assumes a POSIX filesystem
that would have to be emulated~~[^posix-sqlite].
[sqlite]: https://sqlite.org/index.html
[sqlite-whentouse]: https://sqlite.org/whentouse.html
[^posix-sqlite]: It was [pointed out to me](https://news.ycombinator.com/item?id=24338881)
that SQLite doesn't assume the existence of a POSIX filesystem, as I wrongly
stated. Thanks for the correction.
This makes me consider it as a storage backend all by itself. I
initially considered having an SQLite storage backend as one implementation
of the POSIX filesystem storage API that I mentioned. My goal was to rely on
it so I could validate the correctness of the actual implementation, given
SQLite's robustness.
However it may even better to just use SQLite, and get an ACID backend
without recreating a big part of SQLite from scratch. In fact, both Datomic
and PouchDB didn't create an storage backend for themselves, they just
plugged on what already existed and already worked. I'm beginning to think
that it would be wiser to just do the same, and drop entirely the from
scratch implementation that I mentioned.
That's not to say that adding an IndexedDB compatibility layer to SQLite
would be enough to make it fit the other requirements I mention on this
page. SQLite still is an implementation of a update-in-place, SQL,
table-oriented database. It is probably true that cherry-picking the
relevant parts of SQLite (like storage access, consistency, crash recovery,
parser generator, *etc.*) and leaving out the unwanted parts (SQL, tables,
threading, *etc.*) would be better than including the full SQLite stack, but
that's simply an optimization. Both could even coexist, if desired.
SQLite would have to be treated similarly to how Datomic treats SQL
databases: instead of having a table for each entities, spread attributes
over the tables, *etc.*, it treats SQL databases as a key-value storage so it
doesn't have to re-implement interacting with the disk that other databases
do well.
The tables would contain blocks of binary data, so there isn't a difference
on how the SQLite storage backend behaves and how the IndexedDB storage
backend behaves, much like how Datomic works the same regardless of the
storage backend, same for PouchDB.
I welcome corrections on what I said above, too.
[**PouchDB**][pouchdb] is another great example: it's a full reimplementation of
[CouchDB][couchdb] that targets JavaScript environments, mainly the browser and
Node.js. However I want a tool that can be deployed anywhere, and not limit its
applications to places that already have a JavaScript runtime environment, or
force the developer to bundle a JavaScript runtime environment with their
application. This is true for GTK+ applications, command line programs, Android
apps, *etc.*
[pouchdb]: https://pouchdb.com/
[couchdb]: https://couchdb.apache.org/
[**Mentat**][mentat] was an interesting project, but its reliance on SQLite
makes it inherit most of the downsides (and benefits too) of SQLite itself.
[mentat]: https://github.com/mozilla/mentat
Having such a requirement imposes a different approach to storage: we have to
decouple the knowledge about the intricacies of storage from the usage of
storage itself, so that a module (say query processing) can access storage
through an API without needing to know about its implementation. This allows
the database to target a POSIX filesystems storage API and an IndexedDB storage
API, and make the rest of the code agnostic about storage. PouchDB has such
mechanism (called [adapters][pouchdb-adapters]) and Datomic has them too (called
[storage services][datomic-storage-services]).
[pouchdb-adapters]: https://pouchdb.com/adapters.html
[datomic-storage-services]: https://docs.datomic.com/on-prem/storage.html
This would allow the database to adapt to where it is embedded: when targeting
the browser the IndexedDB storage API would provide the persistence layer
that the database requires, and similarly the POSIX filesystem storage API would
provide the persistence layer when targeting POSIX systems (like desktops,
mobile, *etc.*).
But there's also an extra restriction that comes from by being embedded: it
needs to provide and embeddable artifact, most likely a binary library object
that exposes a C compatible FFI, similar to
[how SQLite does][sqlite-amalgamation]. Bundling a full runtime environment is
possible, but doesn't make it a compelling solution for embedding. This rules
out most languages, and leaves us with C, Rust, Zig, and similar options that
can target POSIX systems and WebAssembly.
[sqlite-amalgamation]: https://www.sqlite.org/amalgamation.html
### Immutable
Being immutable means that only new information is added, no in-place update
ever happens, and nothing is ever deleted.
Having an immutable database presents us with similar trade-offs found in
persistent data structures, like lack of coordination when doing reads, caches
being always coherent, and more usage of space.
[**Datomic**][datomic] is the go to database example of this: it will only add
information (datoms) and allows you to query them in a multitude of ways. Stuart
Halloway calls it "accumulate-only" over "append-only"[^accumulate-only]:
> It's accumulate-only, it is not append-only. So append-only, most people when
> they say that they're implying something physical about what happens.
[datomic]: https://www.datomic.com/
[^accumulate-only]: Video "[Day of Datomic Part 2](https://vimeo.com/116315075)"
on Datomic's information model, at time 12:28.
Also a database can be append-only and overwrite existing information with new
information, by doing clean-ups of "stale" data. I prefer to adopt the
"accumulate-only" naming and approach.
[**Git**][git] is another example of this: new commits are always added on top
of the previous data, and it grows by adding commits instead of replacing
existing ones.
[git]: https://git-scm.com/
Git repositories can only grow in size, and that is not only an acceptable
condition, but also one of the reasons to use it.
All this means that no in-place updates happens on data, and the database will
be much more concerned about how compact and efficiently it stores data than how
fast it does writes to disk. Being embedded, the storage limitation is either a)
how much storage the device has or b) how much storage was designed for the
application to consume. So even though the database could theoretically operate
with hundreds of TBs, a browser page or mobile application wouldn't have access
to this amount of storage. SQLite even [says][sqlite-limits] that it does
support approximately 280 TBs of data, but those limits are untested.
The upside of keeping everything is that you can have historical views of your
data, which is very powerful. This also means that applications should turn this
off when not relevant[^no-history].
[sqlite-limits]: https://sqlite.org/limits.html
[^no-history]: Similar to
[Datomic's `:db/noHistory`](https://docs.datomic.com/cloud/best.html#nohistory-for-high-churn).
### Syncable
This is a frequent topic when talking about offline-first solutions. When
building applications that:
- can fully work offline,
- stores data,
- propagates that data to other application instances,
then you'll need a conflict resolution strategy to handle all the situations
where different application instances disagree. Those application instances
could be a desktop and a browser version of the same application, or the same
mobile app in different devices.
A three-way merge seems to be the best approach, on top of which you could add
application specific conflict resolution functions, like:
- pick the change with higher timestamp;
- if one change is a delete, pick it;
- present the diff on the screen and allow the user to merge them.
Some databases try to make this "easy", by choosing a strategy for you, but I've
found that different applications require different conflict resolution
strategies. Instead, the database should leave this up to the user to decide,
and provide tools for them to do it.
[**Three-way merges in version control**][3-way-merge] are the best example,
performing automatic merges when possible and asking the user to resolve
conflicts when they appear.
The unit of conflict for a version control system is a line of text. The
database equivalent would probably be a single attribute, not a full entity or a
full row.
Making all the conflict resolution logic be local should allow the database to
have encrypted remotes similar to how [git-remote-gcrypt][git-remote-gcrypt]
adds this functionality to Git. This would enable users to sync the application
data across devices using an untrusted intermediary.
[3-way-merge]: https://en.wikipedia.org/wiki/Merge_(version_control)
[git-remote-gcrypt]: https://spwhitton.name/tech/code/git-remote-gcrypt/
### Relational
I want the power of relational queries on the client applications.
Most of the arguments against traditional table-oriented relational databases
are related to write performance, but those don't apply here. The bottlenecks
for client applications usually aren't write throughput. Nobody is interested in
differentiating between 1 MB/s or 10 MB/s when you're limited to 500 MB total.
The relational model of the database could either be based on SQL and tables
like in SQLite, or maybe [datalog][datalog] and [datoms][datoms] like in
Datomic.
[datalog]: https://docs.datomic.com/on-prem/query.html
[datoms]: https://docs.datomic.com/cloud/whatis/data-model.html#datoms
## From aspects to values
Now let's try to translate the aspects above into values, as suggested by Bryan
Cantrill.
### Portability
Being able to target so many different platforms is a bold goal, and the
embedded nature of the database demands portability to be a core value.
### Integrity
When the local database becomes the source of truth of the application, it must
provide consistency guarantees that enables applications to rely on it.
### Expressiveness
The database should empower applications to slice and dice the data in any way
it wants to.
## Next steps
Since I can't find any database that fits these requirements, I've finally come
to terms with doing it myself.
It's probably going to take me a few years to do it, and making it portable
between POSIX and IndexedDB will probably be the biggest challenge. I got myself
a few books on databases to start.
I wonder if I'll ever be able to get this done.
## External links
See discussions on [Reddit][reddit], [lobsters][lobsters], [HN][hn] and
[a lengthy email exchange][lengthy-email].
[reddit]: https://www.reddit.com/r/programming/comments/ijwz5b/the_database_i_wish_i_had/
[lobsters]: https://lobste.rs/s/m9vkg4/database_i_wish_i_had
[hn]: https://news.ycombinator.com/item?id=24337244
[lengthy-email]: https://lists.sr.ht/~euandreh/public-inbox/%3C010101744a592b75-1dce9281-f0b8-4226-9d50-fd2c7901fa72-000000%40us-west-2.amazonses.com%3E
|