Representing relations
In mathematics, a relation is a set of tuples.
[A tuple is an ordered collection of values, $(v_1,\ldots,v_n)$.]
For example, $\{({\tt Bob}, 31), ({\tt Judy}, 32)\}$ is a relation.
In database theory, a relation is a set of tuples with an assigned name for each column; that is, a relation is a table.
For the earlier relation, we could define a header tuple $({\tt name}, {\tt age})$ that names each column.
We could then represent the relation explicitly as a table:
name | age |
---|---|
Bob | 31 |
Judy | 32 |
Relational algebra is a theory for manipulating relations whose power is equivalent to SQL and relational calculus.
Remarkably, relational algebra has only six primitive operations.
I define the six primitives below, but if you're looking for a comprehensive work on relational theory, particularly as it relates to modern databases, I recommend Date's SQL and Relational Theory:
Relations in Unix
Many Unix commands interpret files like relations: each line is a tuple.
CSV or "comma-separated value" files are a crude, generic way to encode a relation. For example, see the MaxMind GeoIP database:
"1.0.0.0","1.0.0.255","16777216","16777471","AU","Australia" "1.0.1.0","1.0.3.255","16777472","16778239","CN","China"
The password and group files for a Unix system (/etc/passwd
and /etc/group
), which store account and login information, are
relations with elements of individual tuples separated by
colons (:
):
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false root:*:0:0:System Administrator:/var/root:/bin/sh daemon:*:1:1:System Services:/var/root:/usr/bin/false
The /etc/hosts
file,
which can fix a hostname to an IP address,
is a space-separated relation:
127.0.0.1 localhost 255.255.255.255 broadcasthost ::1 localhost fe80::1%lo0 localhost
The /etc/resolv.conf
file,
which specifies domain name information,
is a space-separated relation:
domain hsd1.ut.comcast.net. nameserver 208.67.220.220 nameserver 208.67.222.222
The command netstat -nr
produces tables
of routing information.
Destination Gateway Flags Refs Use Netif 127 127.0.0.1 UCS 0 0 lo0 127.0.0.1 127.0.0.1 UH 5 151051 lo0 224.0.0/4 lo0 UmCS 1 0 lo0 224.0.0.251 lo0 UHmW3I 0 0 lo0
The command traceroute
produces
relation-like data:
5 69.139.247.14 16.560 ms 26.933 ms 11.521 ms 6 68.86.90.125 74.433 ms 40.330 ms 57.565 ms 7 68.86.87.30 54.590 ms 52.155 ms 55.322 ms
Many log files are, in effect, giant relations as well.
Union
Mathematically, union $(\cup)$ combines the contents of two sets:
\[ A \cup B = \{ x \mathrel{|} x \in A \text{ or } x \in B \} \]
In the shell, union is straightforward--it's cat
:
$ cat relation-1 relation-2 > unioned-relation
After any operation in which duplicate entries are a possibility,
compose
with sort
and uniq
:
$ cat relation-1 relation-2 | sort | uniq > unioned-relation
Since a relation is a set, duplicate entries are not allowed.
Many other programs accept multiple files as arguments
and print out their contents:
sed
,
grep
,
cut
,
awk
.
Thus, these commands can combine selection (and projection) with union.
To "union" the output of one command with another, use a sequencing semicolon with an append:
$ command-1 > relation; command-2 >> relation
Selection
Mathematically, selection $(\sigma_\phi)$ is a filtering operation that applies a predicate $\phi$ to each member of a set, retaining only those for which the predicate is true:
\[ \sigma_\phi(A) = \{ x \mathrel{|} x \in A \text{ and } \phi(x) \text{ is true } \} \]Many Unix commands in serve as selectors (and simultaneously as union); that is, many programs can filter a file, line by line.
The most common selectors are
sed
,
grep
and
awk
.
For example, awk
can select entries
from /etc/passwd
where the user id and the
primary group id are not equal:
awk -F ":" '{ if ( $3 != $4 ) print }' /etc/passwd
Projection
Mathematically, projection $(\pi_{c_1,\ldots,c_n})$ is a reduction operator that retains the specified columns $(c_1,\ldots,c_n)$ in a relation and discards the rest.
If a relation is table, then $c_1,\ldots,c_n$ can be the names of the columns (and the order in which to preserve them).
Otherwise, $c_1,\ldots,c_n$ can be column numbers so that:
\[ \pi_{i_1,\ldots,i_n}(R) = \{ (x_{i_1}, \ldots, x_{i_n}) \mathrel{|} (x_1,\ldots,x_n) \in R \} \]
The Unix command cut
is a projection-like command.
For example, cut
can project just the user names and shells
out of /etc/passwd
:
$ cut -d ":" -f 1,7 /etc/passwd
The Unix commands sed
and awk
also behave like projectors.
For example, cut
cannot
reorder colums, but awk
can:
$ awk -F":" '{ print $7 ":" $1 }' /etc/passwd
If the syntactic structure of each tuple
is more complex,
then sed
can pull it apart
with a regular expression.
Rename
For tables, the rename operation $(\rho_{c/c'})$ changes the name of column $c'$ to $c$.
Since most Unix tables don't specify headers, or the structure is implicit, there is no equivalent of or need for rename.
Columns are implicitly named positionally.
awk
can easily reorder columns.
Cartesian product
Cartesian product is the source of much power for relational algebra, since it is the key ingredient in joins--the primitive operation that combines two tables together.
Relational Cartesian product $(\times)$ fully combines two relations:
\[ A \times B = \{ (a_1, \ldots, a_n, b_1, \ldots, b_m) \mathrel{|} (a_1,\ldots,a_n) \in A \text{ and } (b_1,\ldots,b_m) \in B \} \]Surprisingly, there is no shell equivalent of Cartesian product.
Fortunately, it's not hard to write a cartesian
script:
#!/bin/bash delim="," # CSV by default # Parse flagged arguments: while getopts "td:" flag do case $flag in d) delim=$OPTARG;; t) delim="\t";; ?) exit;; esac done # Delete the flagged arguments: shift $(($OPTIND -1)) # Remaining args now in $* # Now join $1 and $2 while read a; do while read b; do echo "$a${delim}$b"; done < $2; done < $1
Update: Travis Johnson wrote in
to say that the Unix command paste
seems to provide functionality
of Cartesian product, but it doesn't quite do that.
The paste
joins corresponding lines, but does not compute a product:
% cat f1 a b c % cat f2 d e f % paste -d "," f1 f2 a,d b,e c,f % cartesian f1 f2 a,d a,e a,f b,d b,e b,f c,d c,e c,f
Difference
Mathematically, set difference $(-)$ removes the elements of a set that are present in another set:
\[ A - B = \{ a \mathrel{|} a \in A \text{ and } a \not\in B \} \]
The command diff
reports the difference between two files,
but it does not perform relational set difference.
We need to implement relational set difference.
It is simpler to implement difference
if we assume the existence of a membership tester, memberp
.
The command memberp file line
exits successfully
if the file file contains the line line, and it fails
otherwise.
With access to memberp
, difference
becomes:
#!/bin/bash while read a; do if ! memberp $2 "$a"; then echo "$a" fi done < $1
Membership
A membership script, memberp
, searches line by line:
#!/bin/bash # Usage: memberp file line file=$1 shift while read a; do if [ "$a" = "$*" ]; then exit 0 fi done < $file exit -1
An alternative: comm
The command comm
can also
compute the difference
between two files A and B:
comm -23 <(sort A) <(sort B)
Join
A join is a synthetic operation composed of selection, projection and product.
We could construct it from
A common join is equijoin, where the product of two relations is filtered based on the equality of two columns.
To peform an equijoin, we need two relations and columns to compare.
The command equijoin [ -d <delim> | -t ] rel1 rel2 col1
col2
uses awk
to combine and compare two tables:
#!/bin/bash delim="," # CSV by default # Parse flagged arguments: while getopts "td:" flag do case $flag in d) delim=$OPTARG;; t) delim="\t";; ?) exit;; esac done # Delete the flagged arguments: shift $(($OPTIND -1)) f1=$1 f2=$2 col1=$3 col2=$4 cols=`awk -F "${delim}" '{ print NF ; exit }' $f1` while read a do while read b do echo "$a${delim}$b" done < $f2 done < $f1 | awk -F "${delim}" "{ if ( \$${col1} == \$$(( col2 + cols )) ) print }"
Example: Deleting a list of bad users
We're given a list of bad users to remove from
/etc/passwd
.
They're in the file bad.db
:
matt bob
Let's use a simplified /etc/passwd
for this example:
root:*:0:0:The Admin:/root:/bin/sh matt:*:500:500:Matt:/home/matt:/bin/bash john:*:501:501:John:/home/john:/bin/bash bob:*:502:502:Bob:/home/bob:/bin/bash
How can we delete them relationally?
Start with a product, cartesian -d ":" bad.db /etc/passwd > badpasswd.db
:
matt:root:*:0:0:The Admin:/root:/bin/sh matt:matt:*:500:500:Matt:/home/matt:/bin/bash matt:john:*:501:501:John:/home/john:/bin/bash matt:bob:*:502:502:Bob:/home/bob:/bin/bash bob:root:*:0:0:The Admin:/root:/bin/sh bob:matt:*:500:500:Matt:/home/matt:/bin/bash bob:john:*:501:501:John:/home/john:/bin/bash bob:bob:*:502:502:Bob:/home/bob:/bin/bash
Then, select rows where the bad user matches the account name,
with awk -F: '{ if ( $1 == $2 ) print }' badpasswd.db > offenders.db
:
matt:matt:*:500:500:Matt:/home/matt:/bin/bash bob:bob:*:502:502:Bob:/home/bob:/bin/bash
Now, project out the columns that make the result compatible with
the format of /etc/passwd
with cut -d ":" -f2- offenders.db > kill.db
:
matt:*:500:500:Matt:/home/matt:/bin/bash bob:*:502:502:Bob:/home/bob:/bin/bash
Finally, use difference /etc/passwd kill.db
to
generate a new password file sans malefactors:
root:*:0:0:The Admin:/root:/bin/sh john:*:501:501:John:/home/john:/bin/bash
Exercise
Some of these scripts have quadratic time complexity.
Add a -s
flag to equijoin
and difference
that assumes the inputs have been sorted by sort
and produces a sorted output.
Show that -s
can achieve better time complexity.
Then, rewrite the account-deletion example to eliminate Cartesian product and use
fast equijoin
instead.
Good resources
Sed and awk have become a lost art. They're the only languages I know that frequently beat perl in semantic density. If you haven't learned them yet, you'll be impressed with what they can do for you.
I recommend the O'Reilly classic, sed & awk:
I also recommend spending an hour or two going through the advanced bash scripting guide and bash pitfalls. It's worth the time invested.
If you want a solid theoretical understanding of databases, I recommend Date's SQL and Relational Theory: