BioE 594 : Principles of Bioinformatics
Spring 2002
3:30-4:45p.m., Mondays and Wednesdays, 236 SEO

Instructor Name :Yang Dai


Office Location : SEO233
E-mail : yangdai@uic.edu



Office Hours:
Tuesdays 11:00am-12:00pm
Fridays 1:00pm-2:00pm




Reference Texts
Grading Policy
Covered Topics
Schedule
Homework
Project
 


Inportant Notice
Reference Texts
  • Biological Sequence Analysis (Probabilistic models of proteins and nucleic acids) by R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Cambridge University Press, 1998.
  • Pattern Classification,  2ed Edition by Richard O. Duda, PerterE. Hart, David G. Stork, Wiley, 2001.
  • Machine Learning, byTom M. Mitchell, MCB, McGraw-Hill, 1997.
  A new book published in Year 2001, Good for Biologist to read
  • Bioinformatics: Sequence and Genome Analysis, by David W. Mount, Cold Spring Harbor Laboratory Press, 2001.
       An online book, Good for computer scientist and mathematician to read


Grads  homework 70%,  project 30%



Covered Topics

1. Probability, statistics and information background, Probability distribution, Sampling, Estimation of probabilities from counts, Entropy, Inference, Maximum likelihood estimation, Bayesian    parameter estimation, Markov chains, Hidden Markov Model

2.  Sequence alignment methods

a. Concepts (sequence similarity, homology, alignment, scoring  model)
b. Alignment algorithms and applications
c. dynamic programming
d. heuristic alignment algorithms
e. pairwise alignment using HMM
f. multidimensional dynamic programming
g. multiple alignment by profile HMM training
3. Optimization Methods

        a. greedy algorithms
        b. genetic algorithm
        c. simulated annealing
        d. linear programming and quadratic programming
        e. integer and combinatorial optimization
        f. network optimization

 
4 Supervised and Unsupervised Learning and applications

        a. regression (linear and nonlinear)
        b. classification via regression
        c. multi-layer neural networks, back-propagation
        d. feature selection, feature induction
        e. support vector machines, kernels
        f. clustering

i.   sequential,
ii. hierarchical,
iii. graph theory based,
iv. cost optimized (GA, SA)
v. Self-Organizing Maps


Schedule updated  01/14/02

1. January 7, M    Introduction of the course, DP
2. January 9, W   pairwise sequence alignment
3. January 14, M  local alignment, repeat alignment, overlap alignment
4. January 16, W  probability
           January 21, M   Martin Luther King Jr. Holiday, no class
5. January 23, W entropy, inference, estimation of probability
6.January 28, M  scoring model, significant of the score
7.January 30, W  Markov chain and Hidden Markov Models-1
8.February 4, M   Markov chain and Hidden Markov Models-1
9.February 6, W   pairwise alignment using HMMs
10. February 11, M profile HMMs for sequence families
11. February 13, W multiple sequence alignment-1
12. February 18, M  multiple sequence alignment-2
            Optimization Methods
13. February 20, W  optimization problems, Linear and Nonlinear Programming methods
14. February 25 M    greedy and Simulated Annealing?
15. February 27, W   genetic algorithm
           Classification Methods
16. March 4, M      supervised learning, linear, nonlinear regression
17. March 6, W     classification via regression
18. March 11, M    neural networks1
19. March 13, W    neural networks 2
           March 18 (Spring Vacation)
20. March 25, M   feature selection
21. March 27, W   support vector machine1
22. April 1, M       support vector machine2
           Clustering Methods
23. April 3, W     sequential
24. April 8, M      hierarchical
25. April 10, W       graph theory based method
26. April 15, M    const optimized (genetic algorithm and simulated annealing)
27. April 17, W    self-organized maps
           Final Presentations
28. April 22, M
29. April 24, W
30. April 29, M
31. May 1, W
             May 6-10 Final Examination
NO EXAM FOR THIS COURSE!



Homework

Homework 1/07/02
    (1)Read the first chapter (Molecular Biology for Computer Scientists) of the online book : Artificial Intelligence and Molecular Biology; Edited by Lawrence Hunter, http://www.aaai.org/Library/Books/Hunter/hunter.html

Homework  01/09//02, due 1/16/02
    (1) p1-2, Dai's note (4 points)
    (2) Exercise 2.9, p22, DEKM (3 points)

Homework  1/14/02, due 1/16/02
    (1) Read p24-p35, DEKM
    (2) Write a pseudocode for computing the repeated match. (3 points)
    (3) Visit this URL http://www.ncbi.nlm.nih.gov/BLAST/ to see what is available for sequence alignment.

Homework 1/23/02, due 1/30/02

      (1)Read p299-p312, DEKM
      (2)Exercise 11.1, P300, DEKM (2 points)
      (3)Prove that mean and variance of the continuous variable X with Normal Distribution density function are 0 and 1, respectively.
           (3 points)
(4)Given a block of aligned amino acids
    B A B A
    A A A C
    A A C C
    A A B A
    A A C C
(i) calculate the frequencies of A,B,C appeared  in the block.
(ii) calculate the frequencies  of A to A, A to B and A to C appeared in  ALL possible ungapped alignments made from the block.  Note
     the alignment  B A B A
                             A A A C
     and the alignment
                             A A A C
                             B A B A
     have to counted.
(iii) If we generate two sequences of the same length 4 at random with the frequencies obtained from (i), and put them into an alignment. What are frequencies that A is aligned to A , A to B, and A to C?
(iv) calculate the log ratios of the frequencies obtained from (ii) and (iii) for each pair A to A, A to B, and A to C. Use 2 for logarithm.
(5 points)

Homework 1/30/02
    Read p46-79, DEKM.
 

Homework 2/04/02
   Read one of the following papers.
(1) A. Krogh, M. Brown, S. Mian and K. Sjölander and D. Haussler, "Hidden Markov Models in Computational Biology: Applications to Protein Modeling," Journal Mol. Biology, 235:1501--1531, February 1994.
(2) A. Krogh, I. S. Mian, and D. Haussler, "A hidden Markov model that finds genes in E. coli DNA", Nucleic Acids Research, 22:4768-4778, 1994.

If you are interested in the implementation aspect of the HMM the following paper may be helpful.
R. Hughey and A. Krogh, "Hidden Markov models for sequence analysis: extension and analysis of the basic method", CABIOS, 12:95-107, 1996.

Homework 2/13/02, due 2/20/02

Write a summary of the above Hughey and Krogh paper. If you are interested in more about HMMs and the applications visit
http://www.soe.ucsc.edu/research/compbio/sam.html
http://pfam.wustl.edu/

Homework 2/26/02, due 3/8/02
Read the paper "Robust linear programming  discrimination of two linearly inseparable sets", (by K. P. Bennett & O. L. Mangasarian), Optimization Methods   and Software 1, 1992, 23-34.
Testing their model (2.11) on the data set ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/wdbc.data.
Set aside randomly 10% of points from benign and  malignant groups,  respectively for model building. Set another 10% of points from benign and  malignant groups, respectively for testing.
Output w, gamma, y, z and report the number of correctly classified points for each group.

As for  LP solver, you can use MATLAB optimization toolbox or you can download LINDO free trial version from the following URL.

http://www.lindo.com/cgi/frameset.cgi?leftdwnld.html;downloadf.html

Homework 3/13/02, due 3/25/02
Use the LP formulation and the same data provided in the previous assignment. Pick up three features by using one of the feature selection schemes. Then form the LP problem  with the three features. Solve the LP problem and compare the result with the one from the previous assignment.

Projects

  • Choose one project from the following list.  You may pair with another person to finish the project.
  • Prepare a 15-minute talk about your project. Submit the materials (source code must be in electronical form).
  • Please sign-up the sheet on my door for the date of your presentation.


1  Implement the Feng-Doolittle progressive multiple alignment algorithm on pp145, DEKM
    The clustering algorithm in step (ii) can be replaced by UPGMA on pp166, DEKM. Test on a  given of unaligned sequences,  and output the result of the multiple alignment.

2  Implement the multiple alignment algorithm based on the HMM profile, p152, DEKM. A good reference is http://www.soe.ucsc.edu/research/compbio/papers/sam_doc/sam_doc.html
    Test on a  given of unaligned sequences,  and output the result of the multiple alignment.

3. Implement the simulated annealing algorithm proposed in  "Probe selection algorithms with applications in the analysis of microbial communities" (J. Borneman et al., Bioinformatics 2001 17: S39-S48) for the  probe selection problem (p.42) .  Test on a given set of  sequences and a given set of probe. Report the result.

4. Implement the multilayer ( one hidden layer) perceptron  neural network algorithm.  Use 90% of wpbc.data from each class to learn the network and remaining 10% for the testing.  ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin
(1) Tune the parameters, such as number of hidden unites, to achieve better performance. Test both radial basis function and logistic threshold function. You may also perform some feature selection techniques.
(2) report the number of correctly classified points for each group and  compare with the result from LP model (that means that you have to test on LP model too..)

5. Visit http://discover.nci.nih.gov/nature2000/naturemain.html and download gene expression data (T-matrix), tab delimited (1375 genes x 60 cells). Implement a clustering algorithm (any clustering algorithms mentioned in the class). Compare your result with the one presented in the paper "A Gene Expression Database for the Molecular Pharmacology of Cancer  Nature Genetics, volume 24, no 3  March 1st, 2000 pp:236-244" (available from the same URL).

6. Built support vector machines to train the data in problem 4. Use 90%  of wpbc.data from each class for training and remaining 10% for the testing. You should test on different kernels (linear, polynomial and Gaussian). You may use the quadratic optimization solver in Matlab or download  other free software from http://www.support-vector.net/software.html

7. You may also select other bioinformatics topics. The conditions are (1) you have to use the methods based on or related to the algorithms taughted in the class, and (2) you have to implement your own algorithm.