Center for Nonlinear Studies

Thursday, September 29, 200511:00 AM - 12:00 PMCNLS Conference Room (TA-3, Bldg 1690)
Seminar
Extracting Biological Meaning from Gene Expression and/or Mutation Data Using Machine Learning and Ontologies
Ben GoertzelVirginia Polytechnic Institute and Biomind LLC
A novel algorithmic approach for recognizing biologically meaningful patterns in microarray gene expression and/or mutation (SNP or heteroplasmic mutation) data is presented. Results on a number of datasets, including those related to Chronic Fatigue Syndrome, Parkinson\'s Disease, aging, lung and prostate cancer, are described. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing two groups of gene expression or mutation profiles. Inputs consist of features that are either direct (gene expression or mutation) data values, or else optionally, in the case of gene expression data, \"enhanced feature values\" derived from these, each one corresponding to the average gene expression across a certain Gene Ontology (GO) or Protein Information Resource (PIR) category. Each feature is assigned a \"usefulness value\" indicating the percentage of successful classification models using that feature. Each feature is also associated with a \"utilization vector\" which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. A set of utilization-based clusters results, in which features are gathered together if classification models habitually found it useful to consider them together because of coexpression or multi-feature interactions. Via application of the method to a variety of human datasets, we find that, compared to traditional statistical methods, the new method yields \"important features\" of greater biological relevance, and finds clusters that have dramatically higher mathematical quality (in the sense of homogeneity and separation) and also yield novel insights into the underlying biological processes.