From 4547b50a74c66955ec89032742aa04e9fd0ea0cb Mon Sep 17 00:00:00 2001 From: EuAndreh Date: Thu, 29 Apr 2021 09:24:47 -0300 Subject: Add article on Codd's paper --- ...a-for-large-shared-data-banks-article-review.md | 130 +++++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 _articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md (limited to '_articles') diff --git a/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md b/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md new file mode 100644 index 0000000..e15b478 --- /dev/null +++ b/_articles/2021-04-29-a-relational-model-of-data-for-large-shared-data-banks-article-review.md @@ -0,0 +1,130 @@ +--- + +title: A Relational Model of Data for Large Shared Data Banks - article-review + +date: 2021-04-29 + +layout: post + +lang: en + +ref: a-relational-model-of-data-for-large-shared-data-banks-article-review + +--- + +This is a review of the article "[A Relational Model of Data for Large Shared Data Banks][codd-article]", by E. F. Codd. + +[codd-article]: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf + +## Data Independence + +Codd brings the idea of *data independence* as a better approach to use on databases. +This is contrast with the existing approaches, namely hierarquical (tree-based) and network-based. + +His main argument is that queries in applications shouldn't depende and be coupled with how the data is represented internally by the database system. +This key idea is very powerful, and something that we strive for in many other places: decoupling the interface from the implementation. + +If the database system has this separation, it can kep the querying interface stable, while having the freedom to change its internal representation at will, for better performance, less storage, etc. + +This is true for most modern database systems. +They can change from B-Trees with leafs containing pointers to data, to B-Trees with leafs containing the raw data , to hash tables. +All that without changing the query interface, only its performance. + +Codd mentions that, from an information representation standpoint, any index is a duplication, but useful for perfomance. + +This data independence also impacts ordering (a *relation* doesn't rely on the insertion order). + +## Duplicates + +His definition of relational data is a bit differente from most modern database systems, namely **no duplicate rows**. + +I couldn't find a reason behind this restriction, though. +For practical purposes, I find it useful to have it. + +## Relational Data + +In the article, Codd doesn't try to define a language, and today's most popular one is SQL. + +However, there is no restriction that says that "SQL database" and "relational database" are synonyms. +One could have a relational database without using SQL at all, and it would still be a relational one. + +The main one that I have in mind, and the reason that led me to reading this paper in the first place, is Datomic. + +Is uses an [edn]-based representation for datalog queries[^edn-queries], and a particular schema used to represent data. + +Even though it looks very weird when coming from SQL, I'd argue that it ticks all the boxes (except for "no duplicates") that defines a relational database, since building relations and applying operations on them is possible. + +Compare and contrast a contrived example of possible representations of SQL and datalog of the same data: + +```sql +-- create schema +CREATE TABLE people ( + id UUID PRIMARY KEY, + name TEXT NOT NULL, + manager_id UUID, + FOREIGN KEY (manager_id) REFERENCES people (id) +); + +-- insert data +INSERT INTO people (id, name, manager_id) VALUES + ("d3f29960-ccf0-44e4-be66-1a1544677441", "Foo", "076356f4-1a0e-451c-b9c6-a6f56feec941"), + ("076356f4-1a0e-451c-b9c6-a6f56feec941", "Bar"); + +-- query data, make a relation + +SELECT employees.name AS 'employee-name', + managers.name AS 'manager-name' +FROM people employees +INNER JOIN people managers ON employees.manager_id = managers.id; +``` + +{% raw %} +``` +;; create schema +#{ {:db/ident :person/id + :db/valueType :db.type/uuid + :db/cardinality :db.cardinality/one + :db/unique :db.unique/value} + {:db/ident :person/name + :db/valueType :db.type/string + :db/cardinality :db.cardinality/one} + {:db/ident :person/manager + :db/valueType :db.type/ref + :db/cardinality :db.cardinality/one}} + +;; insert data +#{ {:person/id #uuid "d3f29960-ccf0-44e4-be66-1a1544677441" + :person/name "Foo" + :person/manager [:person/id #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"]} + {:person/id #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941" + :person/name "Bar"}} + +;; query data, make a relation +{:find [?employee-name ?manager-name] + :where [[?person :person/name ?employee-name] + [?person :person/manager ?manager] + [?manager :person/name ?manager-name]]} +``` +{% endraw %} + +(forgive any errors on the above SQL and datalog code, I didn't run them to check. Patches welcome!) + +This employee example comes from the paper, and both SQL and datalog representations match the paper definition of "relational". + +Both "Foo" and "Bar" are employees, and the data is normalized. +SQL represents data as tables, and Datomic as datoms, but relations could be derived from both, which we could view as: + +``` +employee_name | manager_name +---------------------------- +"Foo" | "Bar" +``` + +[^edn-queries]: You can think of it as JSON, but with a Clojure taste. +[edn]: https://github.com/edn-format/edn + +## Conclusion + +The article also talks about operators, consistency and normalization, which are now so widespread and well-known that it feels a bit weird seeing someone advocating for it. + +I also stablish that `relational != SQL`, and other databases such as Datomic are also relational, following Codd's original definition. -- cgit v1.2.3