|
|
|
|
Available Student Projects 2012
Introduction
In this document, I briefly describe potential student projects. Depending on circumstances, these projects may constitute a bachelor or master thesis, a summer internship, or a honor project. Note that this list is not exhaustive; I am happy to discuss in person further project opportunities.
What you bring:
- High interest and motivation for computational biology research
- Willingness to spend the equivalent of at least 3 months full-time on the project
- A stimulating research environment
- Strong mentorship
- A research project that has the potential to result in a publication
All internships take place at EMBL-EBI near Cambridge (UK). If you are interested, or have any question, please send me an email at dessimoz 'at' ebi 'dot' ac 'dot' uk.
Projects
OMA domains [bio]
OMA is a well-established database identifying orthologs among complete genomes. Until now, our basic evolutionary unit has been entire genes. This works well to investigate evolutionary forces that act on entire genes (gene duplication, speciation, lateral transfers, etc.), but cannot describe an important source of functional innovation: gene fusion and fissions. For that, we need to study evolution at finer granularity than genes (down to protein domains, or at the extreme even single base pairs). We have developed a pipeline that infers the domain architecture of all genes in OMA based on HMM profiles from the PFam database. The goal of this project is to analyse this data, first within a subset of species, then possibly among all 1,000+ genomes in OMA.
References:
Altenhoff, AM, Schneider, A, Gonnet, GH, et al. OMA 2011: orthology inference
among 1000 complete genomes. Nucl Acids Res 2011; 39:D289-94
Sjölander, K, Datta, RS, Shen, Y, et al. Ortholog identification in the
presence of domain architecture rearrangement. Brief Bioinform 2011;
12:413-422
Orthology among 10,000 genomes [algo]
OMA is a well-established database identifying orthologs among complete genomes. After 6 years of non-stop computations, we now have orthology predictions among 1,000 full genomes. However, additional genomes are being sequenced faster than ever, and it is unlikely that our current pipeline will be able to handle more than a few thousand genomes. We have some ideas of how to significantly speed up orthology prediction. The purpose of this project would be to test and possibly extend these ideas.
References:
Altenhoff, AM, Schneider, A, Gonnet, GH, et al. OMA 2011: orthology inference
among 1000 complete genomes. Nucl Acids Res 2011; 39:D289-94
Kristensen, DM, Wolf, YI, Mushegian, AR, et al. Computational methods for Gene
Orthology inference. Brief Bioinform 2011; 12:379-391
Extending the Robinson-Foulds metric to labelled topologies [math]
The Robinson-Foulds distance is a well-established distance measure between tree topologies. And indeed, it is has desirable properties: it is efficient to compute, and it is a proper metric. In the context of gene/species tree reconciliation, however, a crucial aspect of the trees ignored by the RF metric is the type of the branching event (speciation, duplication, etc.). Thus, the goal of this project is to extend the Robinson-Foulds measure to capture differences in the type of node (i.e. labels) while keeping the desirable properties of RF. The project does not start from scratch---we already have definitions, proof sketches, and code---but there is enough work left to guarantee a fulfilling internship.
References:
Robinson, DF, Foulds, LR. Comparison of Phylogenetic Trees. Mathematical
Biosciences 1981; 53:131-147
DLIGHT -- Detecting lateral gene transfer [code] [bio] (taken)
A few years ago, we developed an algorithm to identify lateral gene transfers. Despite promising preliminary results, our attention was soon diverted to other, more pressing problems. Meanwhile, DLIGHT ('Distance Likelihood-based Inference of Genes Horizontally Transferred') has mostly languished in neglect. But this is about to change, because of the confluence of 3 circumstances: (1) preliminary work from a recent (unpublished) student project suggests that DLIGHT is highly competitive and thus worth being pursued; (2) we have evidence that lateral gene transfer is highly disruptive to orthology prediction (an area in which we have some interest), so identifying lateral transfer is important; (3) you, dear prospective intern, could help us finish the comparative study of DLIGHT with other state-of-the-art tools, and help us deploy DLIGHT on the 8000+ core EBI cluster to identify laterally transferred genes among the hundreds of bacterial genomes in the OMA database.
References:
Dessimoz, C, Margadant, D, Gonnet, GH. DLIGHT - Lateral Gene Transfer
Detection Using Pairwise Evolutionary Distances in a Statistical Framework.
RECOMB 2008, Lect Notes Comput Sc, Springer 2008; 4955:315-330
Dalquen, DA, Anisimova, M, Gonnet, GH, et al. ALF--A Simulation Framework for
Genome Evolution. Mol Biol Evol 2011
Visualisation of a forest of trees [code] [math]
Recent years have witnessed the development of numerous software packages for visualising phylogenetic trees. But the problem is complicated when more than one tree topology has to be considered at the same time. One solution consists in combining the trees into networks (Huson & Bryant 2006). Another is to draw all trees superimposed (Bouckaert 2010). The goal of this project is to consider these and other solutions (we have some ideas!) to visualise forests in an insightful manner---ideally one that fosters interactive exploration of the data.
References:
Bouckaert, RR. DensiTree: making sense of sets of phylogenetic trees.
Bioinformatics 2010; 26:1372-1373
Huson, DH, Bryant, D. Application of phylogenetic networks in evolutionary
studies. Mol Biol Evol 2006; 23:254-267
http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software
Vectorization of tree inference code [code]
To infer the tree of life, we have developed a new phylogenetic tree building method. Because the system must run on millions of proteins sequences, it is essential that the computation is efficient. The core is currently written in C/C++, but is not optimal. The goal of this project is to improve runtime by optimising the code---in particular by vectorising crucial parts using SIMD instructions. While previous experience with SIMD is desirable, lack thereof can certainly be compensated by a willingness to learn.
References:
Szalkowski, A, Ledergerber, C, Krähenbühl, P, et al. SWPS3 - fast
multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2. BMC
Res Notes 2008; 1:107

