A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and gene expression analysis. Methods include dynamic programming, indexing and hashing, hidden Markov models, and elementary supervised and unsupervised machine learning. Development of practical experience with handling, analyzing, and visualizing genomic data using the computer language Python.
The course will require students to program often in Python. Students coming in to the course must already know how to program in some computer language, but it need not be Python. Additionally, students should be comfortable with mathematical formulas and should have had some exposure to basic probability and molecular or cellular biology; however, the course has no formal course prerequisites, and much relevant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.
Professor Alex Hartemink
Alex Hartemink, Instructor
Jianling Zhong, TA
Kaixuan (Kevin) Luo, TA
Abigail Lin, UTA
David Kornberg, UTA
Joseph Kleinhenz, UTA
Kyle Moran, UTA
If these office hours do not work for you, please post questions via Piazza, or send any of us an email to schedule an alternate time.
The class meets on Tuesdays and Thursdays 10:05–11:20AM in 116 Old Chemistry (on the Main West Quad, near Bostock Library).
Note: The course schedule may change subtly from time to time. Always check the web page for the most up-to-date schedule.
(out/due on Fridays)
|1||Tue 26 Aug||AH||Course introduction; SARS genome introduction|
|2||Thu 28 Aug||AH||Molecular biology primer: DNA, RNA, and protein||PS1 out|
|3||Tue 02 Sep||AH||Gene/genome organization; SARS genome revisited|
|4||Thu 04 Sep||AH||Algorithms and their analysis|
|5||Tue 09 Sep||AH||Algorithm design; Divide-and-conquer introduction|
|6||Thu 11 Sep||AH||Divide-and-conquer exploration||PS1 due; PS2 out|
|7||Tue 16 Sep||AH||Divide-and-conquer fails; Memoization|
|8||Thu 18 Sep||AH||Dynamic programming; Greedy algorithms|
|9||Tue 23 Sep||AH||Sequence variation and the global alignment problem|
|10||Thu 25 Sep||AH||Traceback; Aligning sequences with affine gap scores||PS2 due; PS3 out|
|11||Tue 30 Sep||AH||Affine gap alignment; Local alignment|
|12||Thu 02 Oct||AH||Database similarity searching; FASTA and BLAST heuristics|
|13||Tue 07 Oct||AH||DNA sequencing and the genome assembly problem|
|14||Thu 09 Oct||AH||Human Genome Project and Celera; Next-generation sequencing||PS3 due; PS4 out|
|Tue 14 Oct||FALL BREAK — enjoy!|
|15||Thu 16 Oct||AH||Short-read mapping; Suffix trees|
|16||Tue 21 Oct||AH||Probability; Discrete and continuous random variables; Infinity|
|17||Thu 23 Oct||AH||Joint, marginal, conditional; Bayes rule; Models||PS4 due; PS5 out|
|18||Tue 28 Oct||AH||Parameter estimation; ML, MAP, PME|
|19||Thu 30 Oct||AH||Factoring; Graphical models; Markov models|
|20||Tue 04 Nov||AH||Hidden Markov models; Viterbi decoding|
|21||Thu 06 Nov||AH||Viterbi traceback; Posterior decoding||PS5 due; PS6 out|
|22||Tue 11 Nov||AH||Estimating HMM parameters; Baum-Welch|
|23||Thu 13 Nov||AH||HMMs for finding spliced genes; Profile HMMs for protein families|
|24||Tue 18 Nov||AH||Tree of life and phylogenomics|
|25||Thu 20 Nov||AH||Building phylogenetic trees (UPGMA and NJ)||PS6 due; PS7 out|
|26||Tue 25 Nov||AH||Supervised machine learning: classification|
|Thu 27 Nov||THANKSGIVING BREAK — give thanks!|
|27||Tue 02 Dec||AH||Unsupervised machine learning: clustering|
|28||Thu 04 Dec||AH||Course summary and evaluations||PS7 due|
AH: Alex Hartemink
This is an optional challenge for students interested in applying what we have learned in class to a real computational genomics research problem; practicing the skills of using Python or R (or any other tool you wish) to visualize, analyze, model, and interpret real genomic data; and exploring the science linking chromatin structure and transcriptional regulation. Since this problem represents an open challenge for the genomics community, you are free to choose the approaches you use to analyze the data, as well as the questions you explore. Creative projects are highly encouraged. You may work in small teams (2-3 is ideal). For all submissions we receive by the deadline of 15 Jan 15, we will provide feedback, and will also designate a best project as well as a most creative project. There will be (simple) prizes!
In this data expedition challenge, we will explore next-generation sequencing reads from MNase-seq experiments in yeast. The data was generated to detect genome-wide binding locations of various kinds of DNA-binding proteins. The MNase-seq data sets were collected at Duke as part of our ongoing computational genomic research collaboration with the lab of Prof. David MacAlpine in the Department of Pharmacology and Cancer Biology.
DNA-binding proteins, including nucleosomes and transcription factors (TFs), play essential roles in gene regulation, and their locations along the genome help give us clues about how genes are regulated. Recently, a new MNase-seq protocol was developed by the MacAlpine group at Duke in conjunction with the Henikoff group at the University of Washington.  The basic idea is that genomic locations not bound by proteins are accessible to micrococcal nuclease (MNase) and are therefore more sensitive to MNase digestion. Conversely, genomic locations bound by proteins are less sensitive to MNase digestion.
Consequently, if we sequence the ends of the fragments that remain after MNase digestion, and map the paired sequencing reads that arise, we should be able to see where MNase was able to digest/cut the genome, revealing something about the binding locations of DNA-binding proteins along the genome. It is important to note that the genome of each individual cell in a population may be in a slightly different occupancy/protection state. We collect data from a population of cells so this experiment is sampling the different protection states present in the cell population.
Complicating the issue further, MNase is also known to have a nucleotide-specific bias as it digests DNA, meaning that it tends to cleave/digest certain sequences more than others. For example, it prefers to digest A/T nucleotides compared to G/C (its bias is actually a bit more subtle/complex than that, which is a nice model selection challenge you can explore: what is the simplest model that captures well this bias?). To give you further information about this sequence bias, we are also providing MNase digestion data of naked (deproteinized) DNA in vitro which will allow for the development of models to quantify such bias (because with this data, the variation in cutting that you see is only the result of the MNase interacting with the naked DNA and is not influenced by protein protection).
Usually, sequencing reads are stored in files of fastq format. In this case, we downloaded two large yeast MNase-seq read fastq files: in vivo yeast MNase-seq read files generated by Henikoff et al. , and in vitro yeast MNase-seq read files generated by Deniz et al.  for use in quantifying MNase digestion bias. Both files contain short sequencing reads, of length 25 and 54 base pairs, respectively. The total number of reads in each file is on the order of 100 million. For reference, the yeast genome contains 16 chromosomes whose total size is approximately 12.5 million base pairs.
To analyze those sequencing reads, you would typically first need to map the reads to a reference genome, using tools like BOWTIE. However, to simplify this challenge, we have already performed this mapping step for you. We are thus providing you one tab-delimited text file for each of the first 12 yeast chromosomes, named with ChrI to ChrXII (yeast geneticists like Roman numerals); we will reserve the remaining 4 yeast chromosomes to evaluate your submitted results. Each file contains the start and end genome coordinates of all the reads mapped to that chromosome, one read per line. You may notice that the distances between the start and end coordinates are larger than 25 or 54 base pairs. That is because the MNase-seq experiments produce paired-end reads and we are indicating the coordinates of the spanned fragment from which the two reads come. So the provided coordinates are the start coordinate of one read, along with the start coordinate of its mated read on the opposite strand; or, put another way, the first and last nucleotide of the fragment.
It is reasonable to think of the start and end coordinates as nucleotides just beyond which MNase cleaved the DNA, while the sequence between the start and end coordinates was not digested by MNase. We also provide the whole yeast genome sequence (sacCer2 2008 version, in separate fasta files) if you wish to extract the actual sequence around the cleavage sites based on the provided coordinates.
All data files for this challenge are available from:
You will need to do some independent exploration to figure out what to do next. You may want to read more about the MNase enzyme and how it works, or what is known about it. You probably want to get more info about the MNase-seq protocol, as described in the original paper.  Then you can start exploring one or multiple of the following, depending on what suits your fancy, or you may have other ideas of your own:
Good luck, and have fun on this expedition!
An overview of the DFS and BFS algorithms for visiting the nodes of a graph.
A careful description of the algorithm for finding the closest pair of points in O(n log n) time. This is from the 2nd edition of "Introduction to Algorithms" (fondly known as CLRS).
Here is the SARS genome handout from class. Here is a text file containing the SARS genome (Tor2 isolate). Have fun parsing it! You can also find it, and a lot more information about it, in GenBank: visit the Genbank entry and see what else you can learn.
If you want a little more clarity about how Python creates variables, populates them, and passes them around, or if you want to visualize your code in action for debugging purposes, check out Python Tutor. You can study their code examples as they execute, or paste in your own code.
Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:
The various books mentioned in class are summarized here; each is linked to Amazon where you can read more (these are not affiliate links). Note that none of these books is compulsory for the class, though you may benefit from one or more. As for the books on Python, many resources are now available free online, even complete textbooks downloadable as PDFs (you'll save trees (unless you print them)).
Let's try snarfing and running your first Python program in Eclipse.
First, ensure that you have the right perspective in Eclipse. The Python perspective will give you a less cluttered set of windows with the smaller PyDev Package Explorer on the left and the main editor window on the right.
The Ambient plug-in allows you to browse for and download code online using a tool called Snarf. For each problem set, we will provide you with some code as a framework and possibly some data files, and Snarf will allow you to import these files into your local copy of Eclipse. To snarf your first program, follow the directions below.
You can repeat step 2 every time you edit and save the program.
For each assignment, we will provide a codebase for you to work from, which you will always be able to import by snarfing it into Eclipse.
NOTE: simple Python documentation is available from within Eclipse—just hover over a Python keyword and a tooltip will pop up with a short description.
Now modify the program and then submit the code from within Eclipse.
You can submit as many times as you like, and everything will be stored on the server each time. Thus, if you realize that you did something wrong at the last minute, you can simply resubmit and we'll have both. In general, we will only look at your last submission, so when you resubmit a project, please resubmit all the relevant files, not just the ones you modified.
All students are expected to abide by generally accepted standards of academic integrity. This includes all the various aspects of Duke's Community Standard. Violations of academic integrity will be taken very seriously. In particular, be reminded that it is not acceptable to take the ideas/work of another and pass it off as one's own, even if paraphrased. Ideas/work taken from others, whether peers in the class or not, must always be appropriately cited.
Unless expressly granted in the problem set, all problems should be completed individually; no collaboration is permitted. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, we provide mechanisms where you can consult others to make progress, rather than giving up entirely. Your first course of action is to post a question on Piazza, or to speak to the instructor or TAs.
If for any reason you consult your peers outside of Piazza, it should remain understood that such an interaction must be one of consultation and not collaboration: hints to help overcome a small obstacle rather than answers—after consultation, it is expected that you should still have plenty of thinking to do. In addition, if you do happen to consult with another student, both of you must cite this.
Students generally have two weeks to work on problem sets—not because two weeks are generally required to finish, but 1) to allow students who start early sufficient time to reflect/ruminate on problems where an impasse has been reached (the thought process through which students go while solving a problem often includes some gestation period before things become clear) and 2) to provide flexibility as to when students complete their work while they juggle other requirements and commitments during the semester.
Given this latter point, students should not request extensions for turning in their work beyond the two weeks already allotted. However, this rule has two exceptions:
If you turn in work after the deadline but have already used your free extension, or if you are using your free extension but still turn in work after the extension deadline itself, we will take off 10 points for every 12 hours late (rounded up). So if you are 0–12 hours late, that would be –10; if you are 12–24 hours late, that would be –20, etc. Also, because we send work out to be graded using automated scripts, you must email the instructor any time you turn in late work, and work will not be graded if it is turned in after 5pm on Monday (more than 72 hours late without an extension, or more than 24 hours late with an extension).
We have designed problem sets in the class to permit you to explore the material, and to develop deeper understanding of the material through that exploration. I ask you to focus on the ideas and the learning rather than on the points and the credit; put another way, consider adopting a perspective of how you can work to satisfy yourself rather than work to satisfy me.
All that said, when it comes to grading, we still need to assign points and credit: that is unfortunately unavoidable. However, we have designed our approach to assigning credit in an attempt to be consistent with the perspective of the previous paragraph, and the approach is perhaps a little different from what you may be familiar with in other classes. Specifically, I have asked the TAs to frame their grading in terms of ‘positive earning’ rather than ‘negative error’.
What do I mean by this? Well, a ‘negative error’ approach is one in which one assumes one's work will earn 100 points unless there are mistakes present. Under such an approach, graders are negatively tasked with finding mistakes and errors, and taking away points for any they find.
I have inverted this by choosing to adopt a ‘positive earning’ approach in which an empty problem set earns 0 points, and students earn more points as they demonstrate deeper levels of mastery of the material and challenge. Under such an approach, graders are instead positively tasked with finding ways to give students credit for engaging the material.
A corollary of the ‘negative error’ approach is that unless a student makes a mistake, they are entitled to earn the full 100 points. Conversely, a corollary of the ‘positive earning’ approach is that it is possible for a student to not make any mistakes yet still not earn the full 100 points. For example, this can happen if a student does the bare minimum to engage the material, and while not making any mistakes, never demonstrates mastery or depth of understanding. Our ‘positive earning’ approach not only focuses on the positive instead of the negative, but it also leaves room to grant more credit to students who engage the material more deeply.
I write all this because if you find that you did a problem without making a mistake, but got only +16 when some other student may have gotten +18, it doesn't necessarily mean that something is wrong (though it might be). It could mean that there were some interesting ways to engage the problem you didn't explore that the other student did. An analogy might be from a video game like Mario Brothers: you can successfully rescue the princess but still not end up with the highest score because someone can score higher if they take the time to flip more turtles or punch more coins along the way.
We use rubrics to apply these judgments consistently across the class, and we insist that each problem is graded by a single grader. Because this approach is not necessarily part of the norm on campus, we also tend to do this only in small doses; that is, students who do everything ‘correctly, making no mistakes’ will still score very highly. But earning the full 100 points usually requires more than just ‘no mistakes’; it also requires demonstration of mastery and engagement.
Grades for all work will be recorded and available to students via the course Sakai site. Posting grades will be our only use of Sakai.
This term we will be using Piazza for course announcements, communication, and discussion. The Piazza system is highly catered to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Piazza so everyone can benefit from the responses.
You can find our class posting page at: https://piazza.com/class#fall2014/compsci260.