Relational shell programming

[article index] [] [@mattmight] [rss]

No one would mistake the average shell script for principled software.

Yet, if we look at how scripts are used, patterns emerge.

Unix is a bestiary of ad hoc databases: comma-, colon-, tab- and space-separated tables. Think of /etc/* or /var/log/*, or of columnar commands.

Shell scripts commonly, if unknowingly, compose five (of six) primitive relational-algebraic operations on these tables: union, difference, projection, selection and renaming:

  • cat acts like union;
  • sed and grep act like selection;
  • cut acts like projection;
  • awk can perform renaming; and
  • diff acts (almost) like difference.

Relational algebra (whose sixth primitive operation is Cartesian product) is equivalent to both relational calculus and SQL.

Cartesian product (and equijoin) are not difficult to create in bash.

If you find yourself stumbling into a relational design pattern in a shell script, consider making that relationality rigid and explicit.

Read on to learn a little more about databases, shell scripts or both.


Update: A few readers have pointed out the flexible join command. Others have pointed out that the classic text The AWK Programming Language contains a section on how to implement a relational DB with awk.

Representing relations

In mathematics, a relation is a set of tuples.

[A tuple is an ordered collection of values, $(v_1,\ldots,v_n)$.]

For example, $\{({\tt Bob}, 31), ({\tt Judy}, 32)\}$ is a relation.

In database theory, a relation is a set of tuples with an assigned name for each column; that is, a relation is a table.

For the earlier relation, we could define a header tuple $({\tt name}, {\tt age})$ that names each column.

We could then represent the relation explicitly as a table:

name age
Bob 31
Judy 32

Relational algebra is a theory for manipulating relations whose power is equivalent to SQL and relational calculus.

Remarkably, relational algebra has only six primitive operations.

I define the six primitives below, but if you're looking for a comprehensive work on relational theory, particularly as it relates to modern databases, I recommend Date's SQL and Relational Theory:

Relations in Unix

Many Unix commands interpret files like relations: each line is a tuple.

CSV or "comma-separated value" files are a crude, generic way to encode a relation. For example, see the MaxMind GeoIP database:

  "1.0.0.0","1.0.0.255","16777216","16777471","AU","Australia"
  "1.0.1.0","1.0.3.255","16777472","16778239","CN","China"

The password and group files for a Unix system (/etc/passwd and /etc/group), which store account and login information, are relations with elements of individual tuples separated by colons (:):

  nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
  root:*:0:0:System Administrator:/var/root:/bin/sh
  daemon:*:1:1:System Services:/var/root:/usr/bin/false 

The /etc/hosts file, which can fix a hostname to an IP address, is a space-separated relation:

  127.0.0.1        localhost
  255.255.255.255  broadcasthost
  ::1              localhost 
  fe80::1%lo0      localhost

The /etc/resolv.conf file, which specifies domain name information, is a space-separated relation:

  domain hsd1.ut.comcast.net.
  nameserver 208.67.220.220
  nameserver 208.67.222.222

The command netstat -nr produces tables of routing information.

  Destination     Gateway      Flags      Refs      Use   Netif 
  127             127.0.0.1    UCS           0        0     lo0
  127.0.0.1       127.0.0.1    UH            5   151051     lo0
  224.0.0/4       lo0          UmCS          1        0     lo0
  224.0.0.251     lo0          UHmW3I        0        0     lo0

The command traceroute produces relation-like data:

 5  69.139.247.14  16.560 ms  26.933 ms  11.521 ms
 6  68.86.90.125  74.433 ms  40.330 ms  57.565 ms
 7  68.86.87.30  54.590 ms  52.155 ms  55.322 ms

Many log files are, in effect, giant relations as well.

Union

Mathematically, union $(\cup)$ combines the contents of two sets:

\[ A \cup B = \{ x \mathrel{|} x \in A \text{ or } x \in B \} \]

In the shell, union is straightforward--it's cat:

 $ cat relation-1 relation-2 > unioned-relation

After any operation in which duplicate entries are a possibility, compose with sort and uniq:

 $ cat relation-1 relation-2 | sort | uniq > unioned-relation

Since a relation is a set, duplicate entries are not allowed.

Many other programs accept multiple files as arguments and print out their contents: sed, grep, cut, awk.

Thus, these commands can combine selection (and projection) with union.

To "union" the output of one command with another, use a sequencing semicolon with an append:

  $ command-1 > relation; command-2 >> relation

Selection

Mathematically, selection $(\sigma_\phi)$ is a filtering operation that applies a predicate $\phi$ to each member of a set, retaining only those for which the predicate is true:

\[ \sigma_\phi(A) = \{ x \mathrel{|} x \in A \text{ and } \phi(x) \text{ is true } \} \]

Many Unix commands in serve as selectors (and simultaneously as union); that is, many programs can filter a file, line by line.

The most common selectors are sed, grep and awk.

For example, awk can select entries from /etc/passwd where the user id and the primary group id are not equal:

 awk -F ":" '{ if ( $3 != $4 ) print }' /etc/passwd

Projection

Mathematically, projection $(\pi_{c_1,\ldots,c_n})$ is a reduction operator that retains the specified columns $(c_1,\ldots,c_n)$ in a relation and discards the rest.

If a relation is table, then $c_1,\ldots,c_n$ can be the names of the columns (and the order in which to preserve them).

Otherwise, $c_1,\ldots,c_n$ can be column numbers so that:

\[ \pi_{i_1,\ldots,i_n}(R) = \{ (x_{i_1}, \ldots, x_{i_n}) \mathrel{|} (x_1,\ldots,x_n) \in R \} \]

The Unix command cut is a projection-like command.

For example, cut can project just the user names and shells out of /etc/passwd:

 $ cut -d ":" -f 1,7 /etc/passwd

The Unix commands sed and awk also behave like projectors.

For example, cut cannot reorder colums, but awk can:

 $ awk -F":" '{ print $7 ":" $1 }' /etc/passwd

If the syntactic structure of each tuple is more complex, then sed can pull it apart with a regular expression.

Rename

For tables, the rename operation $(\rho_{c/c'})$ changes the name of column $c'$ to $c$.

Since most Unix tables don't specify headers, or the structure is implicit, there is no equivalent of or need for rename.

Columns are implicitly named positionally.

awk can easily reorder columns.

Cartesian product

Cartesian product is the source of much power for relational algebra, since it is the key ingredient in joins--the primitive operation that combines two tables together.

Relational Cartesian product $(\times)$ fully combines two relations:

\[ A \times B = \{ (a_1, \ldots, a_n, b_1, \ldots, b_m) \mathrel{|} (a_1,\ldots,a_n) \in A \text{ and } (b_1,\ldots,b_m) \in B \} \]

Surprisingly, there is no shell equivalent of Cartesian product.

Fortunately, it's not hard to write a cartesian script:

#!/bin/bash

delim="," # CSV by default

# Parse flagged arguments:
while getopts "td:" flag
do
  case $flag in
    d) delim=$OPTARG;;
    t) delim="\t";;
    ?) exit;;
  esac
done

# Delete the flagged arguments:
shift $(($OPTIND -1))

# Remaining args now in $*

# Now join $1 and $2
while read a; 
 do while read b; 
      do echo "$a${delim}$b"; 
      done < $2;
 done < $1 

Update: Travis Johnson wrote in to say that the Unix command paste seems to provide functionality of Cartesian product, but it doesn't quite do that. The paste joins corresponding lines, but does not compute a product:

% cat f1
a
b
c

% cat f2
d
e
f

% paste -d "," f1 f2
a,d
b,e
c,f

% cartesian f1 f2
a,d
a,e
a,f
b,d
b,e
b,f
c,d
c,e
c,f

Difference

Mathematically, set difference $(-)$ removes the elements of a set that are present in another set:

\[ A - B = \{ a \mathrel{|} a \in A \text{ and } a \not\in B \} \]

The command diff reports the difference between two files, but it does not perform relational set difference.

We need to implement relational set difference.

It is simpler to implement difference if we assume the existence of a membership tester, memberp.

The command memberp file line exits successfully if the file file contains the line line, and it fails otherwise.

With access to memberp, difference becomes:

#!/bin/bash

while read a; 
 do 
   if ! memberp $2 "$a"; then
     echo "$a"
   fi
 done < $1

Membership

A membership script, memberp, searches line by line:

#!/bin/bash

# Usage: memberp file line

file=$1

shift

while read a; 
 do 
   if [ "$a" = "$*" ]; then
     exit 0
   fi
 done < $file

exit -1

An alternative: comm

The command comm can also compute the difference between two files A and B:

comm -23 <(sort A) <(sort B)

Join

A join is a synthetic operation composed of selection, projection and product.

We could construct it from

A common join is equijoin, where the product of two relations is filtered based on the equality of two columns.

To peform an equijoin, we need two relations and columns to compare.

The command equijoin [ -d <delim> | -t ] rel1 rel2 col1 col2 uses awk to combine and compare two tables:

#!/bin/bash

delim="," # CSV by default

# Parse flagged arguments:
while getopts "td:" flag
do
  case $flag in
    d) delim=$OPTARG;;
    t) delim="\t";;
    ?) exit;;
  esac
done

# Delete the flagged arguments:
shift $(($OPTIND -1))

f1=$1
f2=$2
col1=$3
col2=$4

cols=`awk -F "${delim}" '{ print NF ; exit }' $f1`

while read a
 do while read b
      do 
        echo "$a${delim}$b"
      done < $f2
 done < $f1 |
 awk -F "${delim}" "{ if ( \$${col1} == \$$(( col2 + cols )) ) print }"

Example: Deleting a list of bad users

We're given a list of bad users to remove from /etc/passwd.

They're in the file bad.db:

matt
bob

Let's use a simplified /etc/passwd for this example:

root:*:0:0:The Admin:/root:/bin/sh
matt:*:500:500:Matt:/home/matt:/bin/bash
john:*:501:501:John:/home/john:/bin/bash
bob:*:502:502:Bob:/home/bob:/bin/bash

How can we delete them relationally?

Start with a product, cartesian -d ":" bad.db /etc/passwd > badpasswd.db:

matt:root:*:0:0:The Admin:/root:/bin/sh
matt:matt:*:500:500:Matt:/home/matt:/bin/bash
matt:john:*:501:501:John:/home/john:/bin/bash
matt:bob:*:502:502:Bob:/home/bob:/bin/bash
bob:root:*:0:0:The Admin:/root:/bin/sh
bob:matt:*:500:500:Matt:/home/matt:/bin/bash
bob:john:*:501:501:John:/home/john:/bin/bash
bob:bob:*:502:502:Bob:/home/bob:/bin/bash

Then, select rows where the bad user matches the account name, with awk -F: '{ if ( $1 == $2 ) print }' badpasswd.db > offenders.db :

matt:matt:*:500:500:Matt:/home/matt:/bin/bash
bob:bob:*:502:502:Bob:/home/bob:/bin/bash

Now, project out the columns that make the result compatible with the format of /etc/passwd with cut -d ":" -f2- offenders.db > kill.db:

matt:*:500:500:Matt:/home/matt:/bin/bash
bob:*:502:502:Bob:/home/bob:/bin/bash

Finally, use difference /etc/passwd kill.db to generate a new password file sans malefactors:

root:*:0:0:The Admin:/root:/bin/sh
john:*:501:501:John:/home/john:/bin/bash

Exercise

Some of these scripts have quadratic time complexity.

Add a -s flag to equijoin and difference that assumes the inputs have been sorted by sort and produces a sorted output.

Show that -s can achieve better time complexity.

Then, rewrite the account-deletion example to eliminate Cartesian product and use fast equijoin instead.

Good resources

Sed and awk have become a lost art. They're the only languages I know that frequently beat perl in semantic density. If you haven't learned them yet, you'll be impressed with what they can do for you.

I recommend the O'Reilly classic, sed & awk:

I also recommend spending an hour or two going through the advanced bash scripting guide and bash pitfalls. It's worth the time invested.

If you want a solid theoretical understanding of databases, I recommend Date's SQL and Relational Theory:

Related pages