summaryrefslogtreecommitdiff
path: root/src/content/blog/2021/04/29/relational-review.adoc
blob: 4b53737c71a01bcf9b895fca182d6f42d04bc684 (about) (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
= A Relational Model of Data for Large Shared Data Banks - article-review

:empty:
:reviewed-article: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

This is a review of the article "{reviewed-article}[A Relational Model of Data
for Large Shared Data Banks]", by E. F. Codd.

== Data Independence

Codd brings the idea of _data independence_ as a better approach to use on
databases.  This is contrast with the existing approaches, namely hierarquical
(tree-based) and network-based.

His main argument is that queries in applications shouldn't depende and be
coupled with how the data is represented internally by the database system.
This key idea is very powerful, and something that we strive for in many other
places: decoupling the interface from the implementation.

If the database system has this separation, it can kep the querying interface
stable, while having the freedom to change its internal representation at will,
for better performance, less storage, etc.

This is true for most modern database systems.  They can change from B-Trees
with leafs containing pointers to data, to B-Trees with leafs containing the raw
data , to hash tables.  All that without changing the query interface, only its
performance.

Codd mentions that, from an information representation standpoint, any index is
a duplication, but useful for perfomance.

This data independence also impacts ordering (a _relation_ doesn't rely on the
insertion order).

== Duplicates

His definition of relational data is a bit differente from most modern database
systems, namely *no duplicate rows*.

I couldn't find a reason behind this restriction, though.  For practical
purposes, I find it useful to have it.

== Relational Data

:edn: https://github.com/edn-format/edn

In the article, Codd doesn't try to define a language, and today's most popular
one is SQL.

However, there is no restriction that says that "SQL database" and "relational
database" are synonyms.  One could have a relational database without using SQL
at all, and it would still be a relational one.

The main one that I have in mind, and the reason that led me to reading this
paper in the first place, is Datomic.

Is uses an {edn}[edn]-based representation for datalog
queries{empty}footnote:edn-queries[
  You can think of it as JSON, but with a Clojure taste.
], and a particular schema used to represent data.

Even though it looks very weird when coming from SQL, I'd argue that it ticks
all the boxes (except for "no duplicates") that defines a relational database,
since building relations and applying operations on them is possible.

Compare and contrast a contrived example of possible representations of SQL and
datalog of the same data:

[source,sql]
----
-- create schema
CREATE TABLE people (
  id UUID PRIMARY KEY,
  name TEXT NOT NULL,
  manager_id UUID,
  FOREIGN KEY (manager_id) REFERENCES people (id)
);

-- insert data
INSERT INTO people (id, name, manager_id) VALUES
  ("d3f29960-ccf0-44e4-be66-1a1544677441", "Foo", "076356f4-1a0e-451c-b9c6-a6f56feec941"),
  ("076356f4-1a0e-451c-b9c6-a6f56feec941", "Bar");

-- query data, make a relation

SELECT employees.name AS 'employee-name',
       managers.name AS 'manager-name'
FROM people employees
INNER JOIN people managers ON employees.manager_id = managers.id;
----

[source,clojure]
----
;; create schema
#{{:db/ident       :person/id
   :db/valueType   :db.type/uuid
   :db/cardinality :db.cardinality/one
   :db/unique      :db.unique/value}
  {:db/ident       :person/name
   :db/valueType   :db.type/string
   :db/cardinality :db.cardinality/one}
  {:db/ident       :person/manager
   :db/valueType   :db.type/ref
   :db/cardinality :db.cardinality/one}}

;; insert data
#{{:person/id      #uuid "d3f29960-ccf0-44e4-be66-1a1544677441"
   :person/name    "Foo"
   :person/manager [:person/id #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"]}
  {:person/id      #uuid "076356f4-1a0e-451c-b9c6-a6f56feec941"
   :person/name    "Bar"}}

;; query data, make a relation
{:find [?employee-name ?manager-name]
 :where [[?person  :person/name    ?employee-name]
         [?person  :person/manager ?manager]
         [?manager :person/name    ?manager-name]]}
----

(forgive any errors on the above SQL and datalog code, I didn't run them to
check.  Patches welcome!)

This employee example comes from the paper, and both SQL and datalog
representations match the paper definition of "relational".

Both "Foo" and "Bar" are employees, and the data is normalized.  SQL represents
data as tables, and Datomic as datoms, but relations could be derived from both,
which we could view as:

[source,sql]
----
employee_name | manager_name
----------------------------
"Foo"         | "Bar"
----

== Conclusion

The article also talks about operators, consistency and normalization, which are
now so widespread and well-known that it feels a bit weird seeing someone
advocating for it.

I also stablish that `relational != SQL`, and other databases such as Datomic
are also relational, following Codd's original definition.