The same logic that drives matching birthdays also drives the probability that one can find collisions with a hash function.

In other words, if you have a uniform hashing function that outputs a value between 1 and 365 for any input, the probability that two hashes would collide in a set of 23 values is also 50%

One useful calculation is the expected number of collisions for a sequence of \(n\) values when the range of the hash function contains \(D\) hashes.

The closed form solution to the answer is:

\[ n - D + D \left( \frac{D-1}{D} \right)^n \]

(My present interest in this calculation comes from the number of matches that will happen in a patient-matching network that attempts to match patients having the same disease, assuming the are \(D\) total diseases possible and \(n\) patients in the network.)

Read below for the derivation in terms of birthdays.

]]>It takes much of the tedium out of the process of writing and formatting bibliographies.

At the heart of BibTeX is a plaintext `.bib`

file with entries like:

```
@InProceedings{might2006dcfa,
author = {Matthew Might and Olin Shivers},
title = {Environment Analysis via {$\Delta$CFA}},
booktitle = {Proceedings of the 33rd Annual {ACM} Symposium
on the Principles of Programming Languages
({POPL} 2006)},
pages = {127--140},
year = {2006},
address = {Charleston, South Carolina},
month = {January},
abstract = {
We describe a new program-analysis framework, based on CPS
and procedure-string abstractions, that can handle critical
analyses which the k-CFA framework cannot. We present the main
theorems concerning correctness, show an application analysis,
and describe a running implementation.
}
annote = {
My first publication.
TODO: Explore whether this works better with pushdown systems.
}
}
```

Researchers also take notes in their `.bib`

files, so that they serve as a
searchable record of one’s journey through the literature.

To make it easier to search, transform and extract from BibTeX files, I created a simple command-line Racket script for parsing BibTeX.

The script supports:

- inlining all @string declarations;
- transforming into S-Expressions;
- transforming into JSON;
- transforming into XML; and
- transforming into a canonical BibTeX format.

With the ability to transform `.bib`

into a wide variety of accessible
formats, it unlocks this richly structured data for more than simply managing
citations: it becomes a platform for auto-generating CVs, research reports,
web sites and more.

Read on for a short article on how to parse BibTeX.

]]>This post covers the basics for using the low-level layer of that framework, including:

- using low-level responses
- using X-Expression responses
- enabling SSL
- enabling basic authentication
- handling form content
- serving static files

I’ve combined all of these concepts together to create a “minimum viable” academic wiki for a small research lab.

It’s about 500 lines, and it supports:

- LaTeX formatting for math
- markdown syntax
- version control for each page
- remote/offline page editing
- syntax-highlighting for code
- user authentication
- SSL-encrypted connections

Read below for the intro and the wiki.

]]>When sequencing identifies a genotype already associated with human disease, it can short-circuit years of costly and painful one-off disease tests.

But, if sequencing turns up “variants/mutations of uncertain clinical significance,” then a new kind of diagnostic odyssey unfolds.

Narrowing down which variant is responsible for a disorder may require “functional studies”: going to the lab to study cells or genetically modifying organisms in an attempt to link the mutations to the presentation of the disorder.

(Functional studies are not and are unlikly to ever be covered by insurance.)

Alternatively – and preferrably – you can find a second patient to confirm discovery of the disorder.

This article describes how to use the internet to find a second case for a previously unknown genetic disorder.

If you find success with this approach, please email me to let me know how it worked out for you.

]]>But, most don’t simplify it as much as they could.

Tools like yacc don’t accept regular operations (like option and repetition) in their context-free grammar rules.

And, there are no special operations to handle common patterns like comma-separated (or anything-separated) lists.

This need not be the case.

Derivative parsers already easily handle regular operations, and compiling regular operations to yacc- or BNF-style grammars can bring these operations to other tools.

In this post, I’ll describe how to compile away regular (and other) operations in context-free grammars, and I’ll provide a tool – derp2 – to perform this compilation.

Derp2 targets and is ready to use directly with Racket’s `parser-tools`

package.

Post by The White House.

There’s a general agreement that this cohort is going to be different, and, in fact, must be different from its predecessors.

As a rare disease parent, I’m
excited about the possibility of using this cohort to find what NIH Director
Francis Collins called “resilient” individuals: participants that, according to
their genome, *should* exhibit a disease phenotype, yet do not.

These individuals will point the way to therapeutic options for many diseases, rare and common.

As a computer scientist, I’m excited about opportunities for making advances in data science, human computer interaction, mobile computing, security and privacy in the service of this initiative.

I see a trend emerging in the discussion and in my own reflections: to operate at the desired scale at the desired cost per participant, the large national cohort must switch from a “pull” model of medical research to a “push” model of medical research, and it must reimagine participant engagement.

Read below for my reflections on “push versus pull” and other topics of discussion after day one. [I’ll update this after day two.]

]]>I no longer sign NDAs.

Each time I’m offered an NDA, I have to conduct a clause by clause review of its strictures to make sure it’s not overly broad by accident.

This involves a few round trips with the legal department.

Sometimes, it takes me longer to sign the NDA than to evaluate a technical artifact or idea.

Image Credit: J. Patrick Fischer

I’m tired of signing NDAs.

Sometimes (though rarely), that means I can’t work with a particular company or organization, and that’s something I’m willing to accept.

To avoid confusion, I’ve drafted a short standard reply on why I don’t sign NDAs and what I’m willing to do instead: the Professional Academic Alternative to Non-Disclosure Agreements (PAANDA).

If you’re offered an NDA, you’re welcome to offer the PAANDA.

]]>Wasting time is not something I have been able to afford in years.

Read below for a write-up of the time-saving tips and tricks I've accumulated over the last few years.

If you have tips of your own, please send them my way!

]]>Every time I explain them, I feel like I’m using sorcery.

I’ve written posts on memoizing recursive functions with the Y combinator in JavaScript and on the Church encodings in Scheme and in JavaScript.

When I spoke at Hacker School, I used *Python* as the setting in which to derive
Church encodings and the Y combinator for the first time.

In the process, Python seemed to hit a sweet spot for the explanation: it’s a popular language, and the syntax for lambda is concise and close to the original mathematics.

I’m distilling the technical parts of that lecture into this post, and in contrast to prior posts, I’m taking a purely equational reasoning route to Church encodings and the Y combinator – all within Python.

In the end, we’ll have constructed a programming language out of the lambda calculus, and we’ll arrive at the factorial of 5 in the lambda calculus, as embedded in Python:

```
(((lambda f: (((f)((lambda f: ((lambda z: (((f)(((f)(((f)(((f)(((f)
(z)))))))))))))))))))((((((lambda y: ((lambda F: (((F)((lambda x:
(((((((y)(y)))(F)))(x)))))))))))((lambda y: ((lambda F: (((F)((lambda x:
(((((((y)(y)))(F)))(x)))))))))))))((lambda f: ((lambda n: ((((((((((((
lambda n: (((((n)((lambda _: ((lambda t: ((lambda f: (((f)((lambda void:
(void)))))))))))))((lambda t: ((lambda f: (((t)((lambda void: (void)))))
))))))))((((((lambda n: ((lambda m: (((((m)((lambda n: ((lambda f:
((lambda z: (((((((n) ((lambda g: ((lambda h: (((h)(((g)(f)))))))))))
((lambda u: (z)))))((lambda u: (u)))))))))))))(n))))))) (n)))((lambda f:
((lambda z: (z)))))))))((lambda _: ((((lambda n: (((((n) ((lambda _: ((
lambda t: ((lambda f: (((f)((lambda void: (void))))))))))))) ((lambda t:
((lambda f: (((t)((lambda void: (void))))))))))))) ((((((lambda n:
((lambda m: (((((m)((lambda n: ((lambda f: ((lambda z: (((((((n) ((lambda
g: ((lambda h: (((h)(((g)(f)))))))))))((lambda u: (z)))))((lambda u:
(u)))))))))))))(n)))))))((lambda f: ((lambda z: (z)))))))(n)))))))))
((lambda _: ((lambda t: ((lambda f: (((f)((lambda void: (void)))))))))))
))((lambda _: ((lambda f: ((lambda z: (((f)(z)))))))))))((lambda _: (((
(((lambda n: ((lambda m: ((lambda f: ((lambda z: (((((m)(((n)(f)))))(z)
))))))))))(n)))(((f) ((((((lambda n: ((lambda m: (((((m)((lambda n:
((lambda f: ((lambda z: (((((((n) ((lambda g: ((lambda h: (((h)(((g)(f)
))))))))))((lambda u: (z)))))((lambda u: (u)))))))))))))(n)))))))(n)))
((lambda f: ((lambda z: (((f) (z))))))))))))))))))))))))(lambda x:x+1)(0)
```

Run the above in your Python interpreter. It’s equal to 120.

As a bonus, this post is a proof that the indentation-sensitive constructs in Python are strictly optional.

Read below for more.

]]>Justice prevailed in the end for my family, but it was a harrowing introduction to a facet of society with which I had little experience.

Recently, I found myself at a neighborhood security meeting after a spate of break-ins, and I ended up recounting how my old neighborhood banded together, protected itself and cleaned up the block.

To avoid repeating myself, I’m writing down what I learned.

** Disclaimer**: This is based on our personal experience in my city, and with
the U.S. legal system. Please take caution when applying this advice to other
crimes and jurisdictions. Consulting a lawyer is advised.