Office Location : SEO233
E-mail : yangdai@uic.edu
Office Hours:
Tuesdays 11:00am-12:00pm
Fridays 1:00pm-2:00pm
Reference Texts
Grading Policy
Covered Topics
Schedule
Homework
Project
Grads
homework 70%, project 30%
1. Probability, statistics and information background, Probability distribution, Sampling, Estimation of probabilities from counts, Entropy, Inference, Maximum likelihood estimation, Bayesian parameter estimation, Markov chains, Hidden Markov Model
2. Sequence alignment methods
a. Concepts (sequence similarity, homology, alignment, scoring model)b. Alignment algorithms and applications
c. dynamic programming
d. heuristic alignment algorithms
e. pairwise alignment using HMM
f. multidimensional dynamic programming
g. multiple alignment by profile HMM training
a. greedy algorithms
b. genetic algorithm
c. simulated annealing
d. linear programming and quadratic programming
e. integer and combinatorial optimization
f. network optimization
a. regression (linear and nonlinear)
b. classification via regression
c. multi-layer neural networks, back-propagation
d. feature selection, feature induction
e. support vector machines, kernels
f. clustering
i. sequential,
ii. hierarchical,
iii. graph theory based,
iv. cost optimized (GA, SA)
v. Self-Organizing Maps
1. January 7, M
Introduction of the course, DP
2. January 9, W pairwise
sequence alignment
3. January 14, M local alignment,
repeat alignment, overlap alignment
4. January 16, W probability
January 21, M Martin Luther King Jr. Holiday, no class
5. January 23, W entropy, inference,
estimation of probability
6.January 28, M scoring model,
significant of the score
7.January 30, W Markov chain
and Hidden Markov Models-1
8.February 4, M Markov
chain and Hidden Markov Models-1
9.February 6, W pairwise
alignment using HMMs
10. February 11, M profile HMMs
for sequence families
11. February 13, W multiple sequence
alignment-1
12. February 18, M multiple
sequence alignment-2
Optimization Methods
13. February 20, W optimization
problems, Linear and Nonlinear Programming methods
14. February 25 M
greedy and Simulated Annealing?
15. February 27, W genetic
algorithm
Classification Methods
16. March 4, M
supervised learning, linear, nonlinear regression
17. March 6, W
classification via regression
18. March 11, M
neural networks1
19. March 13, W
neural networks 2
March 18 (Spring Vacation)
20. March 25, M feature
selection
21. March 27, W support
vector machine1
22. April 1, M
support vector machine2
Clustering Methods
23. April 3, W
sequential
24. April 8, M
hierarchical
25. April 10, W
graph theory based method
26. April 15, M
const optimized (genetic algorithm and simulated annealing)
27. April 17, W
self-organized maps
Final Presentations
28. April 22, M
29. April 24, W
30. April 29, M
31. May 1, W
May 6-10 Final Examination
NO EXAM FOR THIS COURSE!
Homework 1/07/02
(1)Read the first
chapter (Molecular Biology for Computer Scientists) of the online book
: Artificial Intelligence and Molecular Biology; Edited by Lawrence Hunter,
http://www.aaai.org/Library/Books/Hunter/hunter.html
Homework 01/09//02, due
1/16/02
(1) p1-2, Dai's
note (4 points)
(2) Exercise
2.9, p22, DEKM (3 points)
Homework 1/14/02, due 1/16/02
(1) Read p24-p35,
DEKM
(2) Write a pseudocode
for computing the repeated match. (3 points)
(3) Visit this
URL http://www.ncbi.nlm.nih.gov/BLAST/
to see what is available for sequence alignment.
Homework 1/23/02, due 1/30/02
Homework 1/30/02
Read p46-79,
DEKM.
Homework 2/04/02
Read one of
the following papers.
(1) A. Krogh, M. Brown, S. Mian
and K. Sjölander and D. Haussler, "Hidden
Markov Models in Computational Biology: Applications to Protein Modeling,"
Journal Mol. Biology, 235:1501--1531, February 1994.
(2) A. Krogh, I. S. Mian, and D.
Haussler, "A
hidden Markov model that finds genes in E. coli DNA", Nucleic Acids
Research, 22:4768-4778, 1994.
If you are interested in the implementation aspect of the HMM the following
paper may be helpful.
R. Hughey and A. Krogh, "Hidden
Markov models for sequence analysis: extension and analysis of the basic
method", CABIOS, 12:95-107, 1996.
Homework 2/13/02, due 2/20/02
Write a summary of the above Hughey
and Krogh paper. If you are interested in more about HMMs and the applications
visit
http://www.soe.ucsc.edu/research/compbio/sam.html
http://pfam.wustl.edu/
Homework 2/26/02, due 3/8/02
Read the paper "Robust linear
programming discrimination of two linearly inseparable sets", (by
K. P. Bennett & O. L. Mangasarian), Optimization Methods
and Software 1, 1992, 23-34.
Testing their model (2.11) on the
data set ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin/wdbc.data.
Set aside randomly 10% of points
from benign and malignant groups, respectively for model building.
Set another 10% of points from benign and malignant groups, respectively
for testing.
Output w, gamma, y, z and report
the number of correctly classified points for each group.
As for LP solver, you can use MATLAB optimization toolbox or you can download LINDO free trial version from the following URL.
http://www.lindo.com/cgi/frameset.cgi?leftdwnld.html;downloadf.html
Homework 3/13/02, due 3/25/02
Use the LP formulation and the same
data provided in the previous assignment. Pick up three features by using
one of the feature selection schemes. Then form the LP problem with
the three features. Solve the LP problem and compare the result with the
one from the previous assignment.
1 Implement the Feng-Doolittle progressive multiple alignment
algorithm on pp145, DEKM
The clustering algorithm in step (ii) can be replaced
by UPGMA on pp166, DEKM. Test on a given of unaligned sequences,
and output the result of the multiple alignment.
2 Implement the multiple alignment algorithm based on the HMM
profile, p152, DEKM. A good reference is http://www.soe.ucsc.edu/research/compbio/papers/sam_doc/sam_doc.html
Test on a given of unaligned sequences,
and output the result of the multiple alignment.
3. Implement the simulated annealing algorithm proposed in "Probe selection algorithms with applications in the analysis of microbial communities" (J. Borneman et al., Bioinformatics 2001 17: S39-S48) for the probe selection problem (p.42) . Test on a given set of sequences and a given set of probe. Report the result.
4. Implement the multilayer ( one
hidden layer) perceptron neural network algorithm. Use 90%
of wpbc.data from each class to learn the network and remaining 10% for
the testing. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin
(1) Tune the parameters, such as
number of hidden unites, to achieve better performance. Test both radial
basis function and logistic threshold function. You may also perform some
feature selection techniques.
(2) report the number of correctly
classified points for each group and compare with the result from
LP model (that means that you have to test on LP model too..)
5. Visit http://discover.nci.nih.gov/nature2000/naturemain.html and download gene expression data (T-matrix), tab delimited (1375 genes x 60 cells). Implement a clustering algorithm (any clustering algorithms mentioned in the class). Compare your result with the one presented in the paper "A Gene Expression Database for the Molecular Pharmacology of Cancer Nature Genetics, volume 24, no 3 March 1st, 2000 pp:236-244" (available from the same URL).
6. Built support vector machines to train the data in problem 4. Use 90% of wpbc.data from each class for training and remaining 10% for the testing. You should test on different kernels (linear, polynomial and Gaussian). You may use the quadratic optimization solver in Matlab or download other free software from http://www.support-vector.net/software.html
7. You may also select other bioinformatics topics. The conditions are (1) you have to use the methods based on or related to the algorithms taughted in the class, and (2) you have to implement your own algorithm.