Genome Informatics Overview
Project Directors
Volker Brendel (ISU) - Genetics, Development and Cell Biology
Brook Milligan (NMSU) - Biology
Summary. Genome informatics is a foundation discipline for the CMB training group and for systems-wide approaches for understanding relevant biological problems. ISU CMB faculty are developing algorithms to tackle a variety of analytical tasks, including sequence assembly, comparative genomics, phylogeny reconstruction, gene expression analysis, metabolic network inference, and protein structure determination. New machine learning algorithms are being developed that provide valuable predictive information to refine analytic approaches, and bioinformatics analysis is being accelerated through high performance computing. Researchers from NMSU are engaged in synergistic projects, including the design of domain specific languages for evolutionary biology and the development of computational methods to aid in understanding gene regulatory mechanisms. Although anchored in DNA sequence analysis, genome informatics has direct bearing on gene expression and its phenotypic outcomes, and therefore overlaps with the other two areas of research emphasis.
Bioinformatics databases and tools. The research conducted by ISU/NMSU scientists encompasses a wide spectrum of activities—from the generation of the biological data itself to the development of analytical tools. Due to the diversity of research, CMB trainees will develop a broad perspective on the acquisition, handling and analysis of various forms of biological data.
- Generation of biological data. ISU/NMSU researchers are generating plant, animal and microbe genome sequence data as well as RNA and protein expression data. For example, in an NSF-funded collaborative project involving NMSU and ISU researchers (Desh Ranjan, Mary O’Connell, Enrico Pontelli - NMSU and Srinivas Aluru - ISU), work is being carried out to sequence upstream regions of genes in non-model organisms and to develop bioinformatics methods to predict promoter sequences and regulatory cis-elements. The computational tools will be tested in several NMSU laboratories working with non-model organisms – peppers (O’Connell), alfalfa (Champa Sengupta-Gopalan - NMSU), Xenopus (Elba E. Serrano - NMSU) and mycorrhizal fungi (Peter Lammers - NMSU).
- Development of techniques for data storage and analysis. ISU researchers Srinivas Aluru and Patrick Schnable are developing efficient tree-based storage structures and query algorithms to deliver fast query response times on large-scale databases.
- Database hosting. ISU/NMSU researchers host a number of targeted databases as well as comprehensive repositories that serve a global scientific community. Examples of comprehensive repositories include PlantGDB, an NSF-funded project carried out by Volker Brendel to provide plant sequence data, analysis, annotation and visualization, and BarleyBase, a USDA-funded project hosted by Roger Wise (ISU) and others to provide cereal microarray data. Targeted repositories disseminate maize genome assembly data (Schnable and Aluru), promoter data (O’Connell), and nucleotide sequences of plant viral genomes (Leslie Miller - ISU).
- Development of software for data analysis. Many widely used bioinformatics software tools originated at ISU, and tool development continues to be an active area of research. Examples of software include the DNA sequence trimming program Lucy2 (Hui-Hsien Chou - ISU), genome assembly programs CAP3 (Xiaoqiu Huang - ISU) and PCAP (Huang and Aluru), the EST clustering program PaCE (Aluru), and the ab initio gene prediction program GeneSeqer (Brendel).
- Machine learning tools for bioinformatics. CMB faculty are developing novel machine learning algorithms for data-driven discovery of a priori unknown, potentially biologically meaningful relationships (Vasant Honavar, Drena Dobbs and Robert Jernigan - ISU). The resulting tools are being successfully applied to a broad range of data-driven knowledge acquisition tasks in computational biology, including construction of classifiers for assigning protein sequences to structural or functional families, prediction of putative binding sites in proteins from sequence information and analysis of gene expression patterns.
Parallel methods for large scale problems. ISU researcher Aluru and his collaborators are developing parallel algorithms and software systems to address problems that require unacceptably large run-times on serial machines or have large memory requirements that cannot be satisfied by standard workstations. While some needs can be met by trivial parallelization strategies, such as running multiple instances of the same program on different processors, we are developing parallelization strategies for problems that do not lend themselves to such easy solutions. These efforts have made it possible to analyze large-scale EST sequence data: millions of GSS sequences can now be assembled in a matter of hours instead of days. Other efforts use rigorous dynamic programming-based techniques to perform comparative genomic studies of large syntenic regions across species.
As biological data continues to accumulate, it is expected that this area of research will grow in importance, similar to the role high performance computing now has for solving problems in other scientific disciplines.
Information integration. Data driven exploration of biological questions requires the ability to combine and analyze diverse types of information (e.g., protein sequences, protein structures, structural features of proteins, taxonomies that group proteins into functional families, protein-protein interactions, gene expression data). There are over 500 databases of interest to molecular biologists. Many of these databases are large, dynamic, geographically distributed, and autonomously managed. Therefore, it is neither desirable nor feasible to gather all of the data in a centralized location for analysis. More importantly, because data sources that are created for use in one context often find use in other contexts or applications, semantic differences among autonomously designed, owned, and operated data repositories are simply unavoidable. Effective use of multiple sources of data in a given context requires reconciliation of such semantic differences from the user’s point of view.
To facilitate rapid and flexible assembly of data sets derived from multiple sources, ISU researchers (Honavar, Dobbs, Jernigan) are developing INDUS (Intelligent Data Understanding System) – an information integration software toolkit. INDUS includes modules for ontology editing, inter-ontology mapping, and distributed query optimization.
Comparative genomics and phylogenetic analysis. CMB faculty are analyzing genomic sequence data to understand relationships between organisms and how sequences evolve over time. ISU's David Fernandez-Baca and Oliver Eulenstein are carrying out work under the NSF Tree of Life program to develop supertree-based phylogenetic methods, which involve the assembly of smaller phylogenetic trees from subsets of the taxa. In addition to helping synthesize hypotheses of relationships among larger sets of taxa, supertrees suggest optimal strategies for taxon sampling, reveal emerging patterns in the knowledge base of known phylogenies, and provide useful tools for comparative biologists. NMSU counterparts (Pontelli, Ranjan, Milligan) are developing a computational workbench for evolutionary biology, with the goal of facilitating efficient development and execution of phylogenetic inference applications. The framework is based on their creation of a novel domain specific language Phi/Log, which will provide scientists with data types and operations to directly describe general models of molecular and morphological evolution and to evaluate them in the context of phylogenetic trees and genealogies.
ISU CMB researchers also direct a number of projects that integrate comparative genomics, population genetics and evolutionary biology. An underlying theme in several of these is mechanisms and consequences of DNA sequence duplication. The Voytas lab maps mobile genetic element insertions in model organism genomes to discern non-random distribution patterns. These mobile elements are then studied in the wet lab to dissect mechanisms of integration specificity, which underlie the organization of repetitive DNAs in most eukaryotes. The Wendel lab uses a variety of genomics approaches to unravel the relationship between morphological and developmental change and molecular evolutionary processes, including whole genome duplications. As a model, they are trying to understand the genetic basis for the developmental transformations that occurred during cotton fiber evolution. The Gu lab develops statistical approaches to better understand functional innovation, specification and divergence during gene family evolution. Recent work has focused on integrating gene family phylogenies with gene expression data to understand patterns of gene expression evolution. This has been applied to explore the hypothesis that humans and chimpanzees differ in mental and linguistic capability because of gene regulation changes.