COMPSCI 260: Introduction to Computational Genomics
A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and gene expression analysis. Methods include dynamic programming, indexing and hashing, hidden Markov models, and elementary supervised and unsupervised machine learning. Development of practical experience with handling, analyzing, and visualizing genomic data using the computer language Python.
The course will require students to program often in Python. Students coming in to the course should know how to program in some computer language, but it need not be Python. Students should be comfortable with mathematical formulas and should have had some exposure to basic probability and molecular or cellular biology; however, the course has no formal course prerequisites, and significant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.
Professor Alex Hartemink
Alex Hartemink, Instructor
Email: amink at cs.duke.edu
Office Hours: Tue 11:20a-12:30p, walking back from class and then in LSRC D239; Thu 11:20a-11:50a, immediately after class; and also by appointment.
Jianling Zhong, TA
Email: zhong at cs.duke.edu
Office Hours: Fri 3:00p-4:00p, in LSRC D309
Michael Mayhew, TA
Email: michael.mayhew at duke.edu
Office Hours: Mon 5:00p-7:00p, in Twinnie's
Anna Liu, UTA
Email: yanjun.liu at duke.edu
Office Hours: Mon 7:00p-9:00p, in Keohane 4E common room (glass box)
Chinmay Patwardhan, UTA
chinmay.patwardhan at duke.edu
Wed 11:00a-1:00p, in the Link (map)
Thanh-ha Nguyen, UTA
thanhha.nguyen at duke.edu
Tue 7:00p-9:00p, in the Link (map)
If these office hours do not work for you, please post questions via Piazza, or send any of us an email to schedule an alternate time.
The class meets on Tuesdays and Thursdays 10:05–11:20AM in 116 Old Chemistry (on the Main West Quad, near Bostock Library).
Note: The course schedule may change subtly from time to time. Always check the web page for the most up-to-date schedule.
||Tue 27 Aug
||Course introduction; SARS genome introduction
||Thu 29 Aug
||Molecular biology primer: DNA, RNA, and protein
||Tue 03 Sep
||Gene/genome organization; SARS genome revisited
||Thu 05 Sep
||Algorithms and their analysis
||Tue 10 Sep
||Algorithm design; Divide-and-conquer introduction
||Thu 12 Sep
||PS1 due; PS2 out
||Tue 17 Sep
||Divide-and-conquer fails; Memoization
||Thu 19 Sep
||Dynamic programming; Greedy algorithms
||Tue 24 Sep
||Sequence variation and the global alignment problem
||Thu 26 Sep
||Traceback; Aligning sequences with affine gap scores
||PS2 due; PS3 out
||Tue 01 Oct
||Affine gap alignment; Local alignment
||Thu 03 Oct
||Database similarity searching; FASTA and BLAST heuristics
||Tue 08 Oct
||DNA sequencing; Genome assembly; Human Genome Project and Celera
||Thu 10 Oct
||Next-generation sequencing; Short-read mapping
||PS3 due; PS4 out
||Tue 15 Oct
||FALL BREAK — enjoy!
||Thu 17 Oct
||Short-read mapping; Suffix trees
||Tue 22 Oct
||Tree of life and phylogenomics
||Thu 24 Oct
||Building phylogenetic trees (UPGMA and NJ)
||PS4 due; PS5 out
||Tue 29 Oct
||Unsupervised machine learning: clustering
||Thu 31 Oct
||Supervised machine learning: classification
||Tue 05 Nov
||Probability; Discrete and continuous random variables; Infinity
||Thu 07 Nov
||Joint, marginal, conditional; Bayes rule
||PS5 due; PS6 out
||Tue 12 Nov
||Models; Parameter estimation; ML, MAP, PME
||Thu 14 Nov
||Bayesian networks; Markov and hidden Markov models
||Tue 19 Nov
||Thu 21 Nov
||PS6 due; PS7 out
||Tue 26 Nov
||Estimating HMM parameters; Baum-Welch; HMMs for finding spliced genes
||Thu 28 Nov
||THANKSGIVING BREAK — give thanks!
||Tue 03 Dec
||PSSMs; Profile HMMs
||Thu 05 Dec
||Course summary and evaluations
AH: Alex Hartemink
- GENSCAN: Burge and Karlin 1997; as an aside, the senior author of the GENSCAN paper, Samuel Karlin, is the same fellow that helped develop the significance statistics for BLAST database searching
Tree of life papers
Genome sequencing technology papers
Papers debating the merits of shotgun sequencing the whole human genome
Papers reporting a newly sequenced genome
Seminal papers developing sequence alignment and database search
An overview of the DFS and BFS algorithms for visiting the nodes of a graph.
Closest pair of points
A careful description of the algorithm for finding the closest pair of points in O(n log n) time. This is from the 2nd edition of "Introduction to Algorithms" (fondly known as CLRS).
Severe acute respiratory syndrome (SARS)
Here is the SARS genome handout from class. Here is a text file containing the SARS genome (Tor2 isolate). Have fun parsing it! You can also find it, and a lot more information about it, in GenBank: visit the Genbank entry and see what else you can learn.
Cool site for visualizing execution of Python code
If you want a little more clarity about how Python creates variables, populates them, and passes them around, or if you want to visualize your code in action for debugging purposes, check out this cool site. You can study their code examples, or paste in your own code.
Python tutorial slides
For shoring up your biology background
Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:
Textbooks mentioned in class
The various books mentioned in class are summarized here; each is linked to Amazon where you can read more (these are not affiliate links). Note that none of these books is compulsory for the class, though you may benefit from one or more. As for the books on Python, many resources are now available free online, even complete textbooks downloadable as PDFs (you'll save trees (unless you print them)).
- Introduction to Computational Genomics: A Case Studies Approach
- Hahn was a grad student at Duke in the lab of Greg Wray and took this very class once upon a time. I love the case study approach: it makes for very interesting reading. At the end of the day, I decided not to require the book because it's not a perfect match for the course, but in terms of overlap with the course content, it comes the closest of any book to date so it might be useful for some.
- Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
- This book covers a lot of the material that will be covered in the course and has a distinctly probabilistic focus. It is very well written but is at a fairly advanced level. It is an excellent reference for folks continuing on in this area.
- An Introduction to Bioinformatics Algorithms
- This book by Jones and Pevzner covers many of the topics that will be covered in class, but organizes the material around algorithmic methods rather than biological problems. So one chapter presents many problems across bioinformatics that exploit string matching, etc. While not complete for this course, it is a nice, readable, undergraduate textbook and is a bit more approachable than Pevzner's earlier book.
Directions for setting up Java, Eclipse, Python, and Eclipse plugins
- ensure you have Java installed (so you can run Eclipse)
- this may be the trickiest part, so read carefully for your specific operating system
- for Windows and Linux computers:
- you should always be running the latest version for security reasons; currently, this is Java 7, update 25 (sometimes called Java 1.7.0_25); you can check your current version from a terminal by typing "java -version"
- if you don't have the latest version already installed, download the latest JRE (Java Runtime Environment) by selecting the option "Free Java Download" from here (it should determine the proper one to download based on your browser)
- you only need the above JRE to run Eclipse, but if you want to write Java programs later, you can instead download the full JDK (Java SE Development Kit), which includes the JRE, from here (in this case, you will need to select the correct one for your operating system)
- in either case, after downloading, double-click the resulting file and follow the default options to install it
- for Macs, it's a bit more complicated; if you don't already have an updated, working Java installation, you have two choices, one being simpler and the other being newer:
- Simpler: use the latest version of Java provided by Apple (1.6.0_51); if you don't care about having version 7 of Java (and to be clear, there's no reason to for the purposes of this class), then you can download the latest version that Apple provides from here and everything should "just work"
- Newer: use the latest version of Java provided by Oracle (1.7.0_25); this is more complicated than the Apple Java solution because you need to create symlinks as root after installation; follow the directions above for Windows, except that you must 1) download the full JDK (Oracle's JRE does not seem to suffice on the Mac), and 2) add a symlink using the following commands: sudo mkdir /System/Library/Java/JavaVirtualMachines; sudo ln -s /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk /System/Library/Java/JavaVirtualMachines/1.6.0.jdk ; if you don't understand what those two commands are doing, then please choose the simpler Apple Java solution above
- as an aside, you do not need to allow Java to run in your browser to use it for Eclipse; feel free to disable Java in the browser for security
- to summarize, by the end of this step, you should have either a fully-updated version of Java 7 running (with a symlink on a Mac to make it think you're running Java 6), or a fully-updated version of Java 6 running on a Mac
- install Eclipse (an environment for writing and running your Python programs)
- the latest version is 4.3 (nicknamed Kepler)
- select the first option, "Eclipse Standard 4.3", for your operating system from here
- be sure to choose the 32- or 64-bit version to match the version of Java you have installed
- Eclipse is not packaged with an installer; simply unarchive it and move the resulting "eclipse" folder to C:\ (on Windows) or to Applications (on a Mac); on recent Macs, you may get a warning when you try to run "Eclipse.app" for the first time because it's unsigned; in this case, rather than double-clicking it, right-click and choose "Open" and you'll be asked if you really want to open the file; once you accept, you won't need to do this again: it will run by double-clicking from here on out
- install Python (so you can run the programs you write)
- IMPORTANT: though the latest stable version is 3.3.2, we will be using 2.7 in this class
- though Python 2.7 is already installed on recent Macs and most other Unix machines, we recommend everyone install a pre-compiled Python distribution for consistency; this further ensures that everyone has access to a similar set of Python packages
- download Enthought's "Canopy 1.1" package for your operating system from here; be sure the 32- or 64-bit version is selected to match the version of Java you have installed
- double-click the resulting file and follow the default options to install it; after installing it, you will need to run the Canopy application once to set up Python properly on your machine (you don't need to run it again; full details for Macs here; on recent Macs, you may get a warning when you do this for the first time because it's unsigned; in this case, rather than double-clicking it, right-click and choose "Open" and you'll be asked if you really want to open the file; once you accept, you won't need to do this again: it will run by double-clicking from here on out)
- install the PyDev plugin within Eclipse (so Eclipse can help you develop and run your Python programs)
- open Eclipse and access the Help menu
- select "Install New Software..."
- in the "Work with:" box, type "http://pydev.org/updates" and press Enter
- you may need to wait up to a minute until the "Pending..." is replaced by "PyDev" and "PyDev Mylyn Integration (Optional)"
- select only "PyDev" (by checking the box next to it) and click "Next >" down at the bottom
- follow the next steps to finish the installation using the defaults and agreeing to terms and conditions; IMPORTANT: you should see a window asking you to approve/trust a self-signed certificate from the PyDev publisher Brainwy; you will need to select the certificate's check-box and then approve it or Eclipse will not finish installing properly (though it might seem like it has)
- at the end, agree to restart Eclipse for changes to take effect
- install the Ambient plugin within Eclipse (so you can snarf and submit files for class)
- open Eclipse and access the Help menu
- select "Install New Software..."
- in the "Work with:" box, type "http://www.cs.duke.edu/csed/ambient/update" and press Enter
- you may need to wait a number of seconds until the "Pending..." is replaced by "Ambient"
- select "Ambient" (by checking the box next to it) and click "Next >" down at the bottom
- follow the next steps to finish the installation using the defaults and agreeing to terms and conditions; if you receive a warning about unsigned content, proceed anyway
- at the end, agree to restart Eclipse for changes to take effect
- connect Eclipse to your version of Python (so you can run Python programs within Eclipse)
- open Eclipse and access the Preferences Box (under Window>Preferences on Windows or Eclipse>Preferences on Mac)
- choose "Pydev" and "Interpreter - Python" from the sidebar
- press the "New ..." button to tell Eclipse about Python
- in the resulting dialog box, for the "Interpreter Name" type "Enthought" and for the "Interpreter Executable" type
where UU is your user name on your machine and BB is the number of bits (32 or 64) for your Python installation (probably 64, but it depends on which version you installed; you should be able to find the right one by browsing to it)
- for Windows: "C:\Users\UU\AppData\Local\Enthought\Canopy\User\python.exe"
- for Mac: "/Users/UU/Library/Enthought/Canopy_BBbit/User/bin/python"
- choose "Select All" and then click "OK" at the bottom of the resulting dialog box
- click "OK" at the bottom of the Preferences Box and wait for the changes to take effect when the dialog box to close (you do not need to restart Eclipse)
Snarfing and running a sample Python program
Let's try snarfing and running your first Python program in Eclipse.
First, ensure that you have the right perspective in Eclipse. The Python perspective will give you a less cluttered set of windows with the smaller PyDev Package Explorer on the left and the main editor window on the right.
- Select "Window > Open Perspective > Other..."
- Select "PyDev", then hit OK.
- You should now see a "Python" box highlighted in the upper right corner of your window. If in the future, your screen setup looks odd, ensure you are in the Python perspective.
The Ambient plug-in allows you to browse for and download code online using a tool called Snarf. For each problem set, we will provide you with some code as a framework and possibly some data files, and Snarf will allow you to import these files into your local copy of Eclipse. To snarf your first program, follow the directions below.
- Snarf in the Snarfing Sample project
- Open Eclipse
- Select "Ambient > Download (Snarf) a Project..."
- This should open a new tab at the bottom called "Snarfer Site Browser". If it does not:
- Select "Window > Show View > Other..."
- Click "Ambient" then select "Snarfer Site Browser" and hit OK
- Right-click within the "Snarfer Site Browser" window, and select "New Site"
- In the window, type "http://www.cs.duke.edu/courses/fall13/compsci260/snarf/"
- Expand the project site "COMPSCI 260, Fall 2013" and then the "Samples" folder to find "Snarfing Sample (1.0)", and double click on it
- Click the "Install Project" button, and in the window that pops up, check the "use default workspace location" box, and click "Finish"
- The "Import project" window will come up; leave the fields unchanged (in particular, leave "Use the downloaded .project file" selected) and click "Finish"
- Expand the "Snarfing Sample" project in the "PyDev Package Explorer" pane on the left side, and double-click "python.intro.pl" to open it up in the editor pane
- Try running the simple Python program that you snarfed
- Click on the "Run" icon on the toolbar (the green circle with the white triangle pointing right) to run the program; this should create a "Console" tab in the bottom right pane and the results of the program should be printed in it. If the console does not appear:
- Select "Window > Show View > Other..."
- Click "General" then select "Console" and hit OK
- Alternatively, select "Run" from the Run menu
- Alternatively, right-click anywhere within the body of the program to see the context menu and then click on "Run As > Python Run"
You can repeat step 2 every time you edit and save the program.
For each assignment, we will provide a codebase for you to work from, which you will always be able to import by snarfing it into Eclipse.
You should also notice that simple Python documentation is available from within Eclipse: just hover over a Python keyword and a tooltip will pop up with a short description.
Editing a sample Python program and submitting a project
Now modify the program and then submit the code from within Eclipse.
- Modify the file "python.intro.py" to change the output in some way
- Save and re-run the program to confirm the output changed as expected
- Now test submitting your new program along with the other files in the project:
- Select "Ambient > Submit a Project for Grading..." to bring up the submit window
- First, you must choose the class and assignment folder you wish to submit to, so click on "compsci260" and select the "test" folder as your destination; then click "Next"
- Select "Submit a single project", choose the project you wish to submit (in this case "Snarfing Sample"), and then hit "Next"; alternatively, if you do not want to submit the entire project, you can choose "Submit from the file system" and then you'll have an option to select and deselect the various files in the project (or elsewhere)
- Once you've got the project and/or files that you want selected, choose "Finish"; you will be asked to enter your Duke NetID and password.
- Congratulations, you have submitted your project!
You can submit as many times as you like, and everything will be stored on the server each time. Thus, if you realize that you did something wrong at the last minute, you can simply resubmit and we'll have both. In general, we will only look at your last submission, so when you resubmit a project, please resubmit all the relevant files, not just the ones you modified.
All students are expected to abide by generally accepted standards of academic integrity. This includes all the various aspects of Duke's Community Standard. Violations of academic integrity will be taken very seriously. In particular, be reminded that it is not acceptable to take the ideas/work of another and pass it off as one's own, even if paraphrased. Ideas taken from others, whether peers in the class or not, must always be appropriately cited.
Unless expressly granted in the problem set, all problems should be completed individually; no collaboration is permitted. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, we provide mechanisms where you can consult others to make progress, rather than giving up entirely. Your first course of action is to post a question on Piazza, or to speak to the instructor or TAs. If for any reason you consult your peers outside of Piazza, it should remain understood that such an interaction must be one of consultation and not collaboration: hints to help overcome a small obstacle rather than answers—after consultation, it is expected that you should still have plenty of thinking to do. In addition, if you happen to consult with another student, both of you must cite this.
Students generally have two weeks to work on problem sets—not because two weeks are generally required to finish, but 1) to allow students who start early sufficient time to reflect/ruminate on problems where an impasse has been reached (the thought process through which students go while solving a problem often includes some gestation period before things become clear) and 2) to provide flexibility as to when students complete their work while they juggle other requirements and commitments during the semester.
Given this latter point, students should not request extensions for turning in their work beyond the two weeks already allotted. However, this rule has two exceptions:
- Everyone invariably has some two-week interval that is especially tough, so students are allowed, once during the semester, to use an extra 48 hours to turn in their work. If you are exercising this option for a specific problem set, please indicate such when you turn in your work. It is entirely up to you when you want to use this one free extension; when you do, you are trusted to not consult the solutions if they happen to be posted before you turn in your work.
- If you are ill for a non-trivial length of time, you may choose to submit a short-term illness notification to the deans; I am notified during this process, at which point we can work out a possible extension if one is necessary.
If you turn in work after the deadline but have already used your free extension, or if you are using your free extension but turn in work after the extension deadline itself, we will take off 10 points for every 12 hours late (rounded up). So if you're 0-12 hours late, that's -10; if you're 12-24 hours late, that's -20, etc.
- Problem sets: 84%
- All students will be expected to complete seven problem sets over the semester, each contributing about equally to this component of the grade.
- Participation: 16%
- Students are expected to attend class regularly and participate in discussions. They are also expected to be engaged via Piazza, posting questions or notes, as well as helping each other as questions arise, or raising interesting points for further conversation. Students should feel comfortable asking questions at any point in class—whether the material is unclear, or simply if it leads you to wonder about a new connection. The instructor encourages an interactive classroom so if something is troubling or exciting you, do not hesitate to speak up about it.
Grades for all work will be recorded and available to students via the course Sakai site. Posting grades will be our only use of Sakai.
This term we will be using Piazza for course announcements, communication, and discussion. The Piazza system is highly catered to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Piazza so everyone can benefit from the responses.
You can find our class posting page at: https://piazza.com/class#fall2013/compsci260.