Principal Investigator
|
Project Title
| Kernel Methods for Bioinformatics |
|---|
Brief Description for General Publications
Splice sites are locations in DNA at the boundaries of exons (which code for proteins) and introns (which do not). The more accurately splice sites can be located, the easier it becomes to identify the coding regions in DNA. Accurate splice site sensors thus form critical components of computational gene finders. Since ever larger chunks of DNA are to be analyzed by gene finders the problem of accurate splice site recognition has never been more important. We pose splice site recognition as a classification problem with the classifier learned from a labeled data set consisting of only local information around the potential splice site. Finding the correct position of splice sites without using global information is assumed to be a difficult task. We apply Support Vector Machines to the splice site recognition problem. More specifically, we analyze the genomes of C. elegans and of humans using custom designed support vector kernels. One of the kernels is adapted from our work on detecting translation initiation sites in vertebrates and another uses a Fisher-kernel like idea. Previous experiments showed that both approaches perform very well on this task. This work led to a conference paper (with talk) and was the topic of the master thesis by S\"oren Sonnenburg. Recently, we found that the standard two-class classification framework does not exactly match the splice site recognition problem. There seems to be biological evidence that the sites have varying biological activities and deciding whether a site is actually an \emph{active} splice site has to be decided locally by comparing it with the other candidate surrounding the site at hand. Hence, one is required to classify a true site only against a relative small number of decoy sites. This problem is related to ranking the sites in the right order (although there are only two possible ranks). We approach this problem with a method called ordinal regression. It tries to learn a a real valued function (related to the activity of a site), but different from the usual regression setting one is not provided with the correct outcomes for given site. Only information of the type: ``the output/activity for this example/site has to be larger than the output/activity for that example/site'' is used. This framework is capable of learning the ``biochemical activity'' of a site without having actually determined it in a experiment. Moreover, it allows one to deal with alternatively spliced genes. New efficient algorithms for solving the ordinal regression problem are to be developed. We started work on a perceptron-like algorithm, leading to an online ordinal regression algorithm (presented at NIPS Workshop in Vancouver in December 2002). However, this algorithm only approximates a solution of a very large quadratic optimization problem (QP). We plan to develop a new SMO-type algorithm and to extend a parallel interior point method that are able to solve the QP exactly even for large sample sizes. The latter one uses considerable more computing time but reaches a better accuracy. |