The value of Test Driven Development in data analysis

August 10, 2020 • PD Schloss • 6 min read

In last Monday’s episode of Code Club I did something that I do regularly. I screwed up. Rasmus Kirkegaard noticed the problem and was kind enough to point out the problem in the comments to that episode. Part of my motivation for making these videos and my general approach to teaching is to normalize mistakes. There are a lot of YouTube tutorials on how to use grep, sed, and every other bash command. The problem is that they’re presented in a way that is too abstract. Those tutorials don’t show how those commands interact with the output from other commands. They’re highly edited and don’t show typos, goofs, or how to identify and solve problems.

Not mine! If you’ve watched along, you’ve seen numerous typos and redos. These are not part of the act. They’re the reality of how I and every other person analyzes data. What I have found is that experience brings the ability to diagnose and solve problems. This process is also missing from from most tutorials. In today’s episode of Code Club, I’m going to show you how I investigated the problem Rasmus pointed out and resolved it. I’m sure I would have found the problem eventually, but by doing these tutorials publicly, we were able to figure it out much faster. Thanks, Rasmus!

As we are doing steps in a data analysis we get a little cocky. We think things are going to work the way we expect. You may recall in last Monday’s episode that there were a few sequences that started and ended with periods to indicate missing data. I wrote a sed command to convert those periods to hyphens to represent gaps in the alignment. As we were going through my solutions to the exercises, I found a bug in that sed command and thought I fixed it. I even checked the output. Unfortunately, I was flustered and rushing. Instead of checking the output for a region that had those leading periods, I checked the output of the full length sequences, which did not have the problem.

Today I’m going to present the approach that I should have taken. It’s related to a concept that is commonly used in programming called Test Driven Development (TDD). It isn’t as widely used in data analysis, but there are ideas in Test Driven Development that we can draw from to make our analysis more robust. The idea is that we start with a set of tests that fail. We then write code to generate output that passes the test. If we later find a situation that produces the wrong output, then we add that situation to our set of tests and modify the code so the test passes. Because modifying the code can cause other tests to fail, every time the code is updated the tests are rerun. This sounds a bit like make, right? Programming languages including R and Python have frameworks that make Test Driven Development much easier to execute. I’m not aware of such a framework for bash. In today’s episode we’re going to figure out where the problem is, create a set of test sequences that trigger the problem, and then modify our code to resolve the problem. Along the way we’ll learn more about sed and grep.

Even if you’re only watching this video to learn more about bash commands and don’t know what a 16S rRNA gene is, I’m sure you’ll get a lot out of today’s video. Please take the time to follow along on your own computer and attempt the exercises. Don’t worry if you aren’t sure how to solve the exercises, at the end of the video I will provide solutions. If you haven’t been following along but would like to, please check out the notes below where you’ll find instructions on catching up, reference notes, and links to supplemental material. You can find my version of the project on GitHub.

Important things to remember

grep

sed

Installations

If you haven’t been following along, you can get caught up by doing the following:

Exercises

1. Write a pipeline to generate a fasta file that contains the 21 copies of the V4 region of the 16S rRNA gene from Photobacterium damselae (GCF_003130755.1)

grep -A 1 "GCF_003130755.1" data/v4/rrnDB.align | grep -v "^--$" > photobacterium_damselae.fasta

2. Write a sed statement to unalign the sequences in data/v4/rrnDB.align. How would you test that it works?

sed "/^[^>]/ s/[.-]//g" test.fasta
sed "/^[^>]/ s/[.-]//g" data/v4/rrnDB.align

3. Add a test statement to code/extract_region.sh to make sure that none of the sequences have periods in them.