3 shell scripts to improve your writing, or "My Ph.D. advisor rewrote himself in bash."

The hardest part of advising Ph.D. students is teaching them how to write.

Fortunately, I've seen patterns emerge over the past couple years.

So, I've decided to replace myself with a shell script.

In particular, I've created shell scripts for catching three problems:

abuse of the passive voice,
weasel words, and
lexical illusions.

And, I've integrated these into the build system of our LaTeX documents.

The point of these scripts is not to ban all use of constructs like the passive voice. (When it comes to writing, there are exceptions to every "rule.")

The point of these scripts is to make sure that my students and I make a conscious choice to use these constructs.

When these scripts highlight a sentence, my students should ask themselves, "Is there a better way to say what I said--a way to make the text read with more clarity and precision?" Often enough, the answer is "yes."

The meta-point of this article is that writers should learn their individual weaknesses. And, when writers are programmers, we should enlist automation to combat these weaknesses.¹

More resources

There are four books at arm's length in my office:

Strunk and White's The Elements of Style is still a good, if not perfect, reference on style. Young writers should calibrate their reading of Elements in light of criticism from linguistic experts. Experts claim that the good parts of Strunk and White are common sense. I take issue only with their application of the modifier common. My counterparts in linguistics and the humanities must have never seen the aggressive absence of style that permeates writing in science, engineering and mathematics.
After reading this post, my colleague Zachary Tatlock recommended Style: The Basics of Clarity and Grace as a better reference than Strunk and White. Having had a chance to read it, I agree with Zachary. If you can get only one book, this is the one to get. (I'd still recommend a thoughtful reading of both.)
When I have a dispute, The Chicago Manual of Style ends it.
A Manual for Writers of Research Papers, Theses, and Dissertations is the Chicago manual for technical writing.

Precision and clarity

My Ph.D. advisor, Olin Shivers, taught me that technical writing is a balancing act between precision, clarity and marketing.

After a recent round of paper submissions with my own Ph.D. students, I've identified mechanically recognizable ways that precision and clarity leak out of a paper: weasel words and abuse of the passive voice.

So, I've written shell scripts to detect these leaks.

(I don't think I'll ever be able to write a shell script that detects bad marketing for a scientific idea.)

Weasel words

Weasel words--phrases or words that sound good without conveying information--obscure precision.

I notice three kinds of weasel words in my students' writing: (1) salt and pepper words, (2) beholder words and (3) lazy words.

Salt and pepper words

New grad students sprinkle in salt and pepper words for seasoning. These words look and feel like technical words, but convey nothing.

My favorite salt and pepper words/phrases are various, a number of, fairly, and quite. Sentences that cut these words out become stronger.

 Bad:    It is quite difficult to find untainted samples.
 Better: It is difficult to find untainted samples.

 Bad:    We used various methods to isolate four samples.
 Better: We isolated four samples.

Beholder words

Beholder words are those whose meaning is a function of the reader; for example: interestingly, surprisingly, remarkably, or clearly.

Peer reviewers don't like judgments drawn for them.

Bad:    False positives were surprisingly low.
Better: To our surprise, false positives were low.
Good:   To our surprise, false positives were low (3%).

Lazy words

Students insert lazy words in order to avoid making a quantitative characterization. They give the impression that the author has not yet conducted said characterization.

These words make the science feel unfirm and unfinished.

The two worst offenders in this category are the words very and extremely. These two adverbs are never excusable in technical writing. Never.

Other offenders include several, exceedingly, many, most, few, vast.

 Bad:    There is very close match between the two semantics.
 Better: There is a close match between the two semantics.

Adverbs

In technical writing, adverbs tend to come off as weasel words.

I'd even go so far as to say that the removal of all adverbs from any technical writing would be a net positive for my newest graduate students. (That is, new graduate students weaken a sentence when they insert adverbs more frequently than they strengthen it.)

 Bad:    We offer a completely different formulation of CFA.
 Better: We offer a different formulation of CFA.

A script to find weasel words

With this script, you can supply an alternate list of weasel words in a file if you don't like the default:

#!/bin/bash

weasels="many|various|very|fairly|several|extremely\
|exceedingly|quite|remarkably|few|surprisingly\
|mostly|largely|huge|tiny|((are|is) a number)\
|excellent|interestingly|significantly\
|substantially|clearly|vast|relatively|completely"

wordfile=""

# Check for an alternate weasel file
if [ -f $HOME/etc/words/weasels ]; then
    wordfile="$HOME/etc/words/weasels"
fi

if [ -f $WORDSDIR/weasels ]; then
    wordfile="$WORDSDIR/weasels"
fi

if [ -f words/weasels ]; then
    wordfile="words/weasels"
fi

if [ ! "$wordfile" = "" ]; then
    weasels="xyzabc123";
    for w in `cat $wordfile`; do
        weasels="$weasels|$w"
    done
fi


if [ "$1" = "" ]; then
 echo "usage: `basename $0` <file> ..."
 exit
fi

egrep -i -n --color "\\b($weasels)\\b" $*

exit $?

Passive voice

There are times when the passive voice is acceptable in technical writing.

I also believe, as with adverbs, that removal of the passive voice would have been a net improvement for over half the technical writing I've edited. (That is, students abuse the passive voice more often than they use it well.)

Of course, I do not advocate dogmatic removal of the passive voice.

The passive voice is tough to shake. Even while writing this article, I caught myself defaulting to the passive in situations where the active was better.

The passive voice is bad when it hides relevant or explanatory information:

 Bad:    Termination is guaranteed on any input.
 Better: Termination is guaranteed on any input by a finite state-space.
 OK:     A finite state-space guarantees termination on any input.

In the first sentence, the passive hides relevant information.

The second sentence includes the relevant information, but the passive misplaces the emphasis.

The third sentence contains all the relevant information, and it feels crisp.

There's one case where I think the passive is preferrable in technical writing--when the subject is truly irrelevant:

 OK: 4 mL HCl were added to the solution.

Even in this example, I personally don't believe it's egregious to use we:

 OK (to me): We added 4 mL HCl to the solution.

In summary, for each use of the passive highlighted by my script, ask the following questions:

Is the agent relevant yet unclear?
Does the text read better with the sentence in the active?

If the answer to both questions is "yes," then change to the active.

If only the answer to the first question is "yes," then specify the agent.

A script to find passive voice

#!/bin/bash

irregulars="awoken|\
been|born|beat|\
become|begun|bent|\
beset|bet|bid|\
bidden|bound|bitten|\
bled|blown|broken|\
bred|brought|broadcast|\
built|burnt|burst|\
bought|cast|caught|\
chosen|clung|come|\
cost|crept|cut|\
dealt|dug|dived|\
done|drawn|dreamt|\
driven|drunk|eaten|fallen|\
fed|felt|fought|found|\
fit|fled|flung|flown|\
forbidden|forgotten|\
foregone|forgiven|\
forsaken|frozen|\
gotten|given|gone|\
ground|grown|hung|\
heard|hidden|hit|\
held|hurt|kept|knelt|\
knit|known|laid|led|\
leapt|learnt|left|\
lent|let|lain|lighted|\
lost|made|meant|met|\
misspelt|mistaken|mown|\
overcome|overdone|overtaken|\
overthrown|paid|pled|proven|\
put|quit|read|rid|ridden|\
rung|risen|run|sawn|said|\
seen|sought|sold|sent|\
set|sewn|shaken|shaven|\
shorn|shed|shone|shod|\
shot|shown|shrunk|shut|\
sung|sunk|sat|slept|\
slain|slid|slung|slit|\
smitten|sown|spoken|sped|\
spent|spilt|spun|spit|\
split|spread|sprung|stood|\
stolen|stuck|stung|stunk|\
stridden|struck|strung|\
striven|sworn|swept|\
swollen|swum|swung|taken|\
taught|torn|told|thought|\
thrived|thrown|thrust|\
trodden|understood|upheld|\
upset|woken|worn|woven|\
wed|wept|wound|won|\
withheld|withstood|wrung|\
written"

if [ "$1" = "" ]; then
 echo "usage: `basename $0` <file> ..."
 exit
fi

egrep -n -i --color \
 "\\b(am|are|were|being|is|been|was|be)\
\\b[ ]*(\w+ed|($irregulars))\\b" $*

exit $?

A script to find lexical illusions

Read the following text:

 Many readers are not aware that the
 the brain will automatically ignore
 a second instance of the word "the"
 when it starts a new line.

Read that same text again, but with different line breaks:

 Many readers are not aware that the the
 brain will automatically ignore a second
 instance of the word "the" when it starts
 a new line.

Duplicating words is a phenomenon of electronic composition.

They seem to happen as cut and paste accidents, and most frequently it's with the word the.

Unfortunately, it can be difficult to proofread away duplicate words, because this lexical illusion prevents us from finding them.

No reviewer will shoot down a submission solely because it contains duplicate words, but when small mistakes like spelling errors and duplicate words pile up, they convey a lack of proofreading.

Reviewers will (rightfully) interpret inadequate proofreading as a lack of respect for their time and attention.

Fortunately, a short perl script hunts these bugs down:

#!/usr/bin/env perl

# Finds duplicate adjacent words.

use strict ;

my $DupCount = 0 ;

if (!@ARGV) {
  print "usage: dups <file> ...\n" ;
  exit ;
}

while (1) {
  my $FileName = shift @ARGV ;

  # Exit code = number of duplicates found.  
  exit $DupCount if (!$FileName) ;

  open FILE, $FileName or die $!; 
  
  my $LastWord = "" ;
  my $LineNum = 0 ;
  
  while (<FILE>) {
    chomp ;

    $LineNum ++ ;
    
    my @words = split (/(\W+)/) ;
    
    foreach my $word (@words) {
      # Skip spaces:
      next if $word =~ /^\s*$/ ;

      # Skip punctuation:
      if ($word =~ /^\W+$/) {
        $LastWord = "" ;
        next ;
      }
      
      # Found a dup? 
      if (lc($word) eq lc($LastWord)) {
        print "$FileName:$LineNum $word\n" ;
        $DupCount ++ ;
      } # Thanks to Sean Cronin for tip on case.

      # Mark this as the last word:
      $LastWord = $word ;
    }
  }
  
  close FILE ;
}

Makefile integration

I keep a local copy of the scripts in the bin/ directory of each paper's repository. Then, I add a make proof rule to Makefile:

# Check style:
proof:
        echo "weasel words: "
        sh bin/weasel *.tex
        echo
        echo "passive voice: "
        sh bin/passive *.tex
        echo
        echo "duplicates: "
        perl bin/dups *.tex

A few words on marketing

A grad student's first impulse when she starts grad school is to assume that, as long as she tells the whole truth and nothing but the truth, everything she writes has to be accepted for publication.

But, there are a lot of true things.

Given the volume of submissions to top peer-reviewed venues, there will always be more than enough technically correct papers to fill the venue.

The function of peer review has become to decide which true things are worth knowing.

In that sense, peer reviewers are the guardians of the scientific community's most limited resource: our collective attention span.

To market a paper, the author must make a compelling case for why her idea deserves access to that resource.

Resources

Benjamin Beckwith has contributed a "writegood" mode for emacs inspired by these scripts.

[1] For example, my colleague John Regehr, suggested simple scripts to catch students' use of superfluous phrases like, "Note that," and "Notice that." Others have suggested scripts for using the future tense in technical writing.