aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md130
1 files changed, 130 insertions, 0 deletions
diff --git a/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md b/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md
new file mode 100644
index 0000000..e15b478
--- /dev/null
+++ b/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md
@@ -0,0 +1,130 @@
+---
+
+title: A Relational Model of Data for Large Shared Data Banks - article-review
+
+date: 2021-04-29
+
+layout: post
+
+lang: en
+
+ref: a-relational-model-of-data-for-large-shared-data-banks-article-review
+
+---
+
+This is a review of the article "[A Relational Model of Data for Large Shared Data Banks][codd-article]", by E. F. Codd.
+
+[codd-article]: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
+
+## Data Independence
+
+Codd brings the idea of *data independence* as a better approach to use on databases.
+This is contrast with the existing approaches, namely hierarquical (tree-based) and network-based.
+
+His main argument is that queries in applications shouldn't depende and be coupled with how the data is represented internally by the database system.
+This key idea is very powerful, and something that we strive for in many other places: decoupling the interface from the implementation.
+
+If the database system has this separation, it can kep the querying interface stable, while having the freedom to change its internal representation at will, for better performance, less storage, etc.
+
+This is true for most modern database systems.
+They can change from B-Trees with leafs containing pointers to data, to B-Trees with leafs containing the raw data , to hash tables.
+All that without changing the query interface, only its performance.
+
+Codd mentions that, from an information representation standpoint, any index is a duplication, but useful for perfomance.
+
+This data independence also impacts ordering (a *relation* doesn't rely on the insertion order).
+
+## Duplicates
+
+His definition of relational data is a bit differente from most modern database systems, namely **no duplicate rows**.
+
+I couldn't find a reason behind this restriction, though.
+For practical purposes, I find it useful to have it.
+
+## Relational Data
+
+In the article, Codd doesn't try to define a language, and today's most popular one is SQL.
+
+However, there is no restriction that says that "SQL database" and "relational database" are synonyms.
+One could have a relational database without using SQL at all, and it would still be a relational one.
+
+The main one that I have in mind, and the reason that led me to reading this paper in the first place, is Datomic.
+
+Is uses an [edn]-based representation for datalog queries[^edn-queries], and a particular schema used to represent data.
+
+Even though it looks very weird when coming from SQL, I'd argue that it ticks all the boxes (except for "no duplicates") that defines a relational database, since building relations and applying operations on them is possible.
+
+Compare and contrast a contrived example of possible representations of SQL and datalog of the same data:
+
+```sql
+-- create schema
+CREATE TABLE people (
+ id UUID PRIMARY KEY,
+ name TEXT NOT NULL,
+ manager_id UUID,
+ FOREIGN KEY (manager_id) REFERENCES people (id)
+);
+
+-- insert data
+INSERT INTO people (id, name, manager_id) VALUES
+ ("d3f29960-ccf0-44e4-be66-1a1544677441", "Foo", "076356f4-1a0e-451c-b9c6-a6f56feec941"),
+ ("076356f4-1a0e-451c-b9c6-a6f56feec941", "Bar");
+
+-- query data, make a relation
+
+SELECT employees.name AS 'employee-name',
+ managers.name AS 'manager-name'
+FROM people employees
+INNER JOIN people managers ON employees.manager_id = managers.id;
+```
+
+{% raw %}
+```
+;; create schema
+#{ {:db/ident :person/id
+ :db/valueType :db.type/uuid
+ :db/cardinality :db.cardinality/one
+ :db/unique :db.unique/value}
+ {:db/ident :person/name
+ :db/valueType :db.type/string
+ :db/cardinality :db.cardinality/one}
+ {:db/ident :person/manager
+ :db/valueType :db.type/ref
+ :db/cardinality :db.cardinality/one}}
+
+;; insert data
+#{ {:person/id #uuid "d3f29960-ccf0-44e4-be66-1a1544677441"
+ :person/name "Foo"
+ :person/manager [:person/id #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"]}
+ {:person/id #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"
+ :person/name "Bar"}}
+
+;; query data, make a relation
+{:find [?employee-name ?manager-name]
+ :where [[?person :person/name ?employee-name]
+ [?person :person/manager ?manager]
+ [?manager :person/name ?manager-name]]}
+```
+{% endraw %}
+
+(forgive any errors on the above SQL and datalog code, I didn't run them to check. Patches welcome!)
+
+This employee example comes from the paper, and both SQL and datalog representations match the paper definition of "relational".
+
+Both "Foo" and "Bar" are employees, and the data is normalized.
+SQL represents data as tables, and Datomic as datoms, but relations could be derived from both, which we could view as:
+
+```
+employee_name | manager_name
+----------------------------
+"Foo" | "Bar"
+```
+
+[^edn-queries]: You can think of it as JSON, but with a Clojure taste.
+[edn]: https://github.com/edn-format/edn
+
+## Conclusion
+
+The article also talks about operators, consistency and normalization, which are now so widespread and well-known that it feels a bit weird seeing someone advocating for it.
+
+I also stablish that `relational != SQL`, and other databases such as Datomic are also relational, following Codd's original definition.