aboutsummaryrefslogtreecommitdiff
path: root/locale/eo/LC_MESSAGES/_articles/2020-11-12-durable-persistent-trees-and-parser-combinators-building-a-database.po
blob: 262cf16432839913b2950e3aa9aca210878d6032 (about) (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
#
msgid ""
msgstr ""

msgid ""
"title: Durable persistent trees and parser combinators - building a database"
msgstr ""

msgid "date: 2020-11-12"
msgstr ""

msgid "layout: post"
msgstr ""

msgid "lang: en"
msgstr ""

msgid ""
"ref: durable-persistent-trees-and-parser-combinators-building-a-database"
msgstr ""

msgid ""
"I've received with certain frequency messages from people wanting to know if"
" I've made any progress on the database project [I've written about]({% link"
" _articles/2020-08-31-the-database-i-wish-i-had.md %})."
msgstr ""

msgid ""
"There are a few areas where I've made progress, and here's a public post on "
"it."
msgstr ""

msgid "Proof-of-concept: DAG log"
msgstr ""

msgid ""
"The main thing I wanted to validate with a concrete implementation was the "
"concept of modeling a DAG on a sequence of datoms."
msgstr ""

msgid ""
"The notion of a *datom* is a rip-off from Datomic, which models data with "
"time aware *facts*, which come from RDF. RDF's fact is a triple of subject-"
"predicate-object, and Datomic's datoms add a time component to it: subject-"
"predicate-object-time, A.K.A. entity-attribute-value-transaction:"
msgstr ""

msgid ""
"[[person :likes \"pizza\" 0 true]\n"
" [person :likes \"bread\" 1 true]\n"
" [person :likes \"pizza\" 1 false]]\n"
msgstr ""

msgid "The above datoms say:"
msgstr ""

msgid "at time 0, `person` like pizza;"
msgstr ""

msgid "at time 1, `person` stopped liking pizza, and started to like bread."
msgstr ""

msgid ""
"Datomic ensures total consistency of this ever growing log by having a "
"single writer, the transactor, that will enforce it when writing."
msgstr ""

msgid ""
"In order to support disconnected clients, I needed a way to allow multiple "
"writers, and I chose to do it by making the log not a list, but a directed "
"acyclic graph (DAG):"
msgstr ""

msgid ""
"The extra datoms above add more information to build the directionality to "
"the log, and instead of a single consistent log, the DAG could have multiple"
" leaves that coexist, much like how different Git branches can have "
"different \"latest\" commits."
msgstr ""

msgid ""
"In order to validate this idea, I started with a Clojure implementation. The"
" goal was not to write the actual final code, but to make a proof-of-concept"
" that would allow me to test and stretch the idea itself."
msgstr ""

msgid ""
"This code [already "
"exists](https://git.euandreh.xyz/mediator/tree/src/core/clojure/src/mediator.clj?id=db4a727bc24b54b50158827b34502de21dbf8948#n1),"
" but is yet fairly incomplete:"
msgstr ""

msgid ""
"the building of the index isn't done yet (with some [commented "
"code](https://git.euandreh.xyz/mediator/tree/src/core/clojure/src/mediator.clj?id=db4a727bc24b54b50158827b34502de21dbf8948#n295)"
" on the next step to be implemented)"
msgstr ""

msgid ""
"the indexing is extremely inefficient, with "
"[more](https://git.euandreh.xyz/mediator/tree/src/core/clojure/src/mediator.clj?id=db4a727bc24b54b50158827b34502de21dbf8948#n130)"
" "
"[than](https://git.euandreh.xyz/mediator/tree/src/core/clojure/src/mediator.clj?id=db4a727bc24b54b50158827b34502de21dbf8948#n146)"
" "
"[one](https://git.euandreh.xyz/mediator/tree/src/core/clojure/src/mediator.clj?id=db4a727bc24b54b50158827b34502de21dbf8948#n253)"
" occurrence of `O²` functions;"
msgstr ""

msgid "no query support yet."
msgstr ""

msgid "Top-down *and* bottom-up"
msgstr ""

msgid ""
"However, as time passed and I started looking at what the final "
"implementation would look like, I started to consider keeping the PoC "
"around."
msgstr ""

msgid ""
"The top-down approach (Clojure PoC) was in fact helping guide me with the "
"bottom-up, and I now have \"promoted\" the Clojure PoC into a \"reference "
"implementation\". It should now be a finished implementation that says what "
"the expected behaviour is, and the actual code should match the behaviour."
msgstr ""

msgid ""
"The good thing about a reference implementation is that it has no "
"performance of resources boundary, so if it ends up being 1000x slower and "
"using 500× more memory, it should be find. The code can be also 10x or 100x "
"simpler, too."
msgstr ""

msgid "Top-down: durable persistent trees"
msgstr ""

msgid ""
"In promoting the PoC into a reference implementation, this top-down approach"
" now needs to go beyond doing everything in memory, and the index data "
"structure now needs to be disk-based."
msgstr ""

msgid ""
"Roughly speaking, most storage engines out there are based either on B-Trees"
" or LSM Trees, or some variations of those."
msgstr ""

msgid ""
"But when building an immutable database, update-in-place B-Trees aren't an "
"option, as it doesn't accommodate keeping historical views of the tree. LSM "
"Trees may seem a better alternative, but duplication on the files with "
"compaction are also ways to delete old data which is indeed useful for a "
"historical view."
msgstr ""

msgid ""
"I think the thing I'm after is a mix of a Copy-on-Write B-Tree, which would "
"keep historical versions with the write IO cost amortization of memtables of"
" LSM Trees. I don't know of any B-Tree variant out there that resembles "
"this, so I'll call it \"Flushing Copy-on-Write B-Tree\"."
msgstr ""

msgid ""
"I haven't written any code for this yet, so all I have is a high-level view "
"of what it will look like:"
msgstr ""

msgid ""
"like Copy-on-Write B-Trees, changing a leaf involves creating a new leaf and"
" building a new path from root to the leaf. The upside is that writes a lock"
" free, and no coordination is needed between readers and writers, ever;"
msgstr ""

msgid ""
"the downside is that a single leaf update means at least `H` new nodes that "
"will have to be flushed to disk, where `H` is the height of the tree. To "
"avoid that, the writer creates these nodes exclusively on the in-memory "
"memtable, to avoid flushing to disk on every leaf update;"
msgstr ""

msgid ""
"a background job will consolidate the memtable data every time it hits X MB,"
" and persist it to disk, amortizing the cost of the Copy-on-Write B-Tree;"
msgstr ""

msgid ""
"readers than will have the extra job of getting the latest relevant disk-"
"resident value and merge it with the memtable data."
msgstr ""

msgid ""
"The key difference to existing Copy-on-Write B-Trees is that the new trees "
"are only periodically written to disk, and the intermediate values are kept "
"in memory. Since no node is ever updated, the page utilization is maximum as"
" it doesn't need to keep space for future inserts and updates."
msgstr ""

msgid ""
"And the key difference to existing LSM Trees is that no compaction is run: "
"intermediate values are still relevant as the database grows. So this leaves"
" out tombstones and value duplication done for write performance."
msgstr ""

msgid ""
"One can delete intermediate index values to reclaim space, but no data is "
"lost on the process, only old B-Tree values. And if the database ever comes "
"back to that point (like when doing a historical query), the B-Tree will "
"have to be rebuilt from a previous value. After all, the database *is* a set"
" of datoms, and everything else is just derived data."
msgstr ""

msgid ""
"Right now I'm still reading about other data structures that storage engines"
" use, and I'll start implementing the \"Flushing Copy-on-Write B-Tree\" as I"
" learn more[^learn-more-db] and mature it more."
msgstr ""

msgid ""
"[^learn-more-db]: If you are interested in learning more about this too, the"
" very best two resources on this subject are Andy Pavlo's \"[Intro to "
"Database "
"Systems](https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_otTrBTrjyohi)\""
" course and Alex Petrov's \"[Database "
"Internals](https://www.databass.dev/)\" book."
msgstr ""

msgid "Bottom-up: parser combinators and FFI"
msgstr ""

msgid "I chose Rust as it has the best WebAssembly tooling support."
msgstr ""

msgid ""
"My goal is not to build a Rust database, but a database that happens to be "
"in Rust. In order to reach client platforms, the primary API is the FFI one."
msgstr ""

msgid ""
"I'm not very happy with current tools for exposing Rust code via FFI to the "
"external world: they either mix C with C++, which I don't want to do, or "
"provide no access to the intermediate representation of the FFI, which would"
" be useful for generating binding for any language that speaks FFI."
msgstr ""

msgid ""
"I like better the path that the author of "
"[cbindgen](https://github.com/eqrion/cbindgen) crate "
"[proposes](https://blog.eqrion.net/future-directions-for-cbindgen/): "
"emitting an data representation of the Rust C API (the author calls is a "
"`ffi.json` file), and than building transformers from the data "
"representation to the target language. This way you could generate a C API "
"*and* the node-ffi bindings for JavaScript automatically from the Rust code."
msgstr ""

msgid ""
"So the first thing to be done before moving on is an FFI exporter that "
"doesn't mix C and C++, and generates said `ffi.json`, and than build a few "
"transformers that take this `ffi.json` and generate the language bindings, "
"be it C, C++, JavaScript, TypeScript, Kotlin, Swift, Dart, *etc*[^ffi-"
"langs]."
msgstr ""

msgid ""
"[^ffi-langs]: Those are, specifically, the languages I'm more interested on."
" My goal is supporting client applications, and those languages are the most"
" relevant for doing so: C for GTK, C++ for Qt, JavaScript and TypeScript for"
" Node.js and browser, Kotlin for Android and Swing, Swift for iOS, and Dart "
"for Flutter."
msgstr ""

msgid ""
"I think the best way to get there is by taking the existing code for "
"cbindgen, which uses the [syn](https://github.com/dtolnay/syn) crate to "
"parse the Rust code[^rust-syn], and adapt it to emit the metadata."
msgstr ""

msgid ""
"[^rust-syn]: The fact that syn is an external crate to the Rust compiler "
"points to a big warning: procedural macros are not first class in Rust. They"
" are just like Babel plugins in JavaScript land, with the extra shortcoming "
"that there is no specification for the Rust syntax, unlike JavaScript."
msgstr ""

msgid ""
"As flawed as this may be, it seems to be generally acceptable and adopted,\n"
"which works against building a solid ecosystem for Rust.\n"
"\n"
"The alternative that rust-ffi implements relies on internals of the Rust\n"
"compiler, which isn't actually worst, just less common and less accepted.\n"
msgstr ""

msgid ""
"After \"finishing\" ParsecC I'll have a good notion of what a good C API is,"
" and I'll have a better direction towards how to expose code from libedn to "
"other languages, and work on x-bindgen then."
msgstr ""

msgid ""
"What both libedn and ParsecC are missing right now are proper error "
"reporting, and property-based testing for libedn."
msgstr ""

msgid "Conclusion"
msgstr ""

msgid ""
"I've learned a lot already, and I feel the journey I'm on is worth going "
"through."
msgstr ""

msgid ""
"If any of those topics interest you, message me to discuss more or "
"contribute! Patches welcome!"
msgstr ""

msgid "eu_categories: mediator"
msgstr ""

msgid ""
"[[person :likes  \"pizza\"  0 true]\n"
" [0      :parent :db/root 0 true]\n"
" [person :likes  \"bread\"  1 true]\n"
" [person :likes  \"pizza\"  1 false]\n"
" [1      :parent 0        1 true]]\n"
msgstr ""

msgid ""
"I've started a fork of cbindgen: "
"[x-bindgen](https://git.euandreh.xyz/x-bindgen/). Right now it is just a "
"copy of cbindgen verbatim, and I plan to remove all C and C++ emitting code "
"from it, and add a IR emitting code instead."
msgstr ""

msgid ""
"When starting working on x-bindgen, I realized I didn't know what to look "
"for in a header file, as I haven't written any C code in many years. So as I"
" was writing [libedn](https://git.euandreh.xyz/libedn/), I didn't know how "
"to build a good C API to expose. So I tried porting the code to C, and right"
" now I'm working on building a *good* C API for a JSON parser using parser "
"combinators: ~~ParsecC~~ *EDIT*: now archived, the experimentation was fun."
msgstr ""

msgid "updated_at: 2021-02-09"
msgstr ""

#~ msgid "updated_at: 2020-11-14"
#~ msgstr ""

#~ msgid ""
#~ "When starting working on x-bindgen, I realized I didn't know what to look "
#~ "for in a header file, as I haven't written any C code in many years. So as I"
#~ " was writing [libedn](https://git.euandreh.xyz/libedn/), I didn't know how "
#~ "to build a good C API to expose. So I tried porting the code to C, and right"
#~ " now I'm working on building a *good* C API for a JSON parser using parser "
#~ "combinators: [ParsecC](https://git.euandreh.xyz/parsecc/)."
#~ msgstr ""

#~ msgid ""
#~ "I've started a fork of cbindgen: "
#~ "[x-bindgen](https://git.euandreh.xyz/x-bindgen/about). Right now it is just "
#~ "a copy of cbindgen verbatim, and I plan to remove all C and C++ emitting "
#~ "code from it, and add a IR emitting code instead."
#~ msgstr ""

#~ msgid ""
#~ "When starting working on x-bindgen, I realized I didn't know what to look "
#~ "for in a header file, as I haven't written any C code in many years. So as I"
#~ " was writing [libedn](https://git.euandreh.xyz/libedn/about), I didn't know "
#~ "how to build a good C API to expose. So I tried porting the code to C, and "
#~ "right now I'm working on building a *good* C API for a JSON parser using "
#~ "parser combinators: [ParsecC](https://git.euandreh.xyz/parsecc/about)."
#~ msgstr ""

#~ msgid ""
#~ "When starting working on x-bindgen, I realized I didn't know what to look "
#~ "for in a header file, as I haven't written any C code in many years. So as I"
#~ " was writing [libedn](https://git.euandreh.xyz/libedn/), I didn't know how "
#~ "to build a good C API to expose. So I tried porting the code to C, and right"
#~ " now I'm working on building a *good* C API for a JSON parser using parser "
#~ "combinators: [ParsecC](https://git.euandreh.xyz/parsecc/)."
#~ msgstr ""

#~ msgid ""
#~ "[[person :likes  \"pizza\" 0 true]\n"
#~ " [0      :parent null    0 true]\n"
#~ " [person :likes  \"bread\" 1 true]\n"
#~ " [person :likes  \"pizza\" 1 false]\n"
#~ " [1      :parent 0       1 true]]\n"
#~ msgstr ""

#~ msgid "category: mediator"
#~ msgstr ""