Lab Home | Phone | Search
Center for Nonlinear Studies  Center for Nonlinear Studies
 Home 
 People 
 Current 
 Executive Committee 
 Postdocs 
 Visitors 
 Students 
 Research 
 Publications 
 Conferences 
 Workshops 
 Sponsorship 
 Talks 
 Seminars 
 Postdoc Seminars Archive 
 Quantum Lunch 
 Quantum Lunch Archive 
 P/T Colloquia 
 Archive 
 Ulam Scholar 
 
 Postdoc Nominations 
 Student Requests 
 Student Program 
 Visitor Requests 
 Description 
 Past Visitors 
 Services 
 General 
 
 History of CNLS 
 
 Maps, Directions 
 CNLS Office 
 T-Division 
 LANL 
 
Thursday, September 29, 2005
11:00 AM - 12:00 PM
CNLS Conference Room (TA-3, Bldg 1690)

Seminar

Extracting Biological Meaning from Gene Expression and/or Mutation Data Using Machine Learning and Ontologies

Ben Goertzel
Virginia Polytechnic Institute and Biomind LLC

A novel algorithmic approach for recognizing biologically meaningful patterns in microarray gene expression and/or mutation (SNP or heteroplasmic mutation) data is presented. Results on a number of datasets, including those related to Chronic Fatigue Syndrome, Parkinson\'s Disease, aging, lung and prostate cancer, are described. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing two groups of gene expression or mutation profiles. Inputs consist of features that are either direct (gene expression or mutation) data values, or else optionally, in the case of gene expression data, \"enhanced feature values\" derived from these, each one corresponding to the average gene expression across a certain Gene Ontology (GO) or Protein Information Resource (PIR) category. Each feature is assigned a \"usefulness value\" indicating the percentage of successful classification models using that feature. Each feature is also associated with a \"utilization vector\" which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. A set of utilization-based clusters results, in which features are gathered together if classification models habitually found it useful to consider them together because of coexpression or multi-feature interactions. Via application of the method to a variety of human datasets, we find that, compared to traditional statistical methods, the new method yields \"important features\" of greater biological relevance, and finds clusters that have dramatically higher mathematical quality (in the sense of homogeneity and separation) and also yield novel insights into the underlying biological processes.