Machine Learning in Bioinformatics (Wiley Series in Bioinformatics) - Hardcover

 
9780470116623: Machine Learning in Bioinformatics (Wiley Series in Bioinformatics)

Inhaltsangabe

An introduction to machine learning methods and their applications to problems in bioinformatics

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization.

From an internationally recognized panel of prominent researchers in the field, Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics. Coverage includes: feature selection for genomic and proteomic data mining; comparing variable selection methods in gene selection and classification of microarray data; fuzzy gene mining; sequence-based prediction of residue-level properties in proteins; probabilistic methods for long-range features in biosequences; and much more.

Machine Learning in Bioinformatics is an indispensable resource for computer scientists, engineers, biologists, mathematicians, researchers, clinicians, physicians, and medical informaticists. It is also a valuable reference text for computer science, engineering, and biology courses at the upper undergraduate and graduate levels.

Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.

Über die Autorin bzw. den Autor

Yan-Qing Zhang, PhD, is an Associate Professor of Computer Science at the Georgia State University, Atlanta. His research interests include hybrid intelligent systems, neural networks, fuzzy logic, evolutionary computation, Yin-Yang computation, granular computing, kernel machines, bioinformatics, medical informatics, computational Web Intelligence, data mining, and knowledge discovery. He has coauthored two books, and edited one book and two IEEE proceedings. He is program co-chair of the IEEE 7th International Conference on Bioinformatics & Bioengineering (IEEE BIBE 2007) and 2006 IEEE International Conference on Granular Computing (IEEE-GrC2006).

Jagath C. Rajapakse, PhD, is Professor of Computer Engineering and Director of the BioInformatics Research Centre, Nanyang Technological University. He is also Visiting Professor in the Department of Biological Engineering, Massachusetts Institute of Technology. He completed his MS and PhD degrees in electrical and computer engineering at University at Buffalo, State University of New York. Professor Rajapakse has published over 210 peer-reviewed research articles in the areas of neuroinformatics and bioinformatics. He serves as Associate Editor for IEEE Transactions on Medical Imaging and IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Von der hinteren Coverseite

An introduction to machine learning methods and their applications to problems in bioinformatics

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization.

From an internationally recognized panel of prominent researchers in the field, Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics. Coverage includes: feature selection for genomic and proteomic data mining; comparing variable selection methods in gene selection and classification of microarray data; fuzzy gene mining; sequence-based prediction of residue-level properties in proteins; probabilistic methods for long-range features in biosequences; and much more.

Machine Learning in Bioinformatics is an indispensable resource for computer scientists, engineers, biologists, mathematicians, researchers, clinicians, physicians, and medical informaticists. It is also a valuable reference text for computer science, engineering, and biology courses at the upper undergraduate and graduate levels.

Aus dem Klappentext

An introduction to machine learning methods and their applications to problems in bioinformatics

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization.

From an internationally recognized panel of prominent researchers in the field, Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics. Coverage includes: feature selection for genomic and proteomic data mining; comparing variable selection methods in gene selection and classification of microarray data; fuzzy gene mining; sequence-based prediction of residue-level properties in proteins; probabilistic methods for long-range features in biosequences; and much more.

Machine Learning in Bioinformatics is an indispensable resource for computer scientists, engineers, biologists, mathematicians, researchers, clinicians, physicians, and medical informaticists. It is also a valuable reference text for computer science, engineering, and biology courses at the upper undergraduate and graduate levels.

Auszug. © Genehmigter Nachdruck. Alle Rechte vorbehalten.

Machine Learning in Bioinformatics

John Wiley & Sons

Copyright © 2009 John Wiley & Sons, Inc.
All right reserved.

ISBN: 978-0-470-11662-3

Chapter One

FEATURE SELECTION FOR GENOMIC AND PROTEOMIC DATA MINING

Sun-Yuan Kung and Man-Wai Mak

1.1 INTRODUCTION

The extreme dimensionality (also known as the curse of dimensionality) in genomic data has been traditionally a serious concern in many applications. This has motivated a lot of research in feature representation and selection, both aiming at reducing dimensionality of features to facilitate training and prediction of genomic data.

In this chapter, N denotes the number of training data samples, M the original feature dimension, and the full feature is expressed as an M-dimensional vector process

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The subset of features is denoted as an m-dimensional vector process

y (t) = [[[y.sub.1] (t), [y.sub.2] (t), ..., [y.sub.m](t)].sup.T] (1.1)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.2)

where m [less than or equal to] M and [s.sub.i] stands for index of a selected feature.

From the machine learning's perspective, one metric of special interest is the sample-feature ratio N/M. For many multimedia applications, the sample-feature ratios lie in a desirable range. For example, for speech data, the ratio can be as high as 100: 1 or 1000: 1 in favor of training data size. For machine learning, such a favorable ratio plays a vital role in ensuring the statistical significance of training and validation.

Unfortunately, for genomic data, this is often not the case. It is common that the number of samples is barely compatible with, and sometimes severely outnumbered by, the dimension of features. In such situation, it becomes imperative to remove the less relevant features, that is, features with low signal-to-noise ratio (SNR).

It is commonly acknowledged that more features means more information available for disposal, that is,

I [A] [less than or equal to] I [A [union] B)] [less than or equal to] ..., (1.3)

where A and B represent two features, say [x.sub.i] and [x.sub.j], respectively, and I(X) denotes information of X. However, the redundant and noisy nature of genomic data makes it not always advantageous but sometimes imperative to work with properly selected features.

1.1.1 Reduction of Dimensionality (Biological Perspectives)

In genomic applications, each gene (or protein sequence) corresponds to a feature in gene profiling (or protein sequencing) applications. Feature selection/representation has its own special appeal from the genomic data mining perspective. For example, it is a vital preprocessing stage critical for processing micro array data. For gene expression profiles, the following factors necessitate an efficient gene selection strategy.

1. Unproportionate Feature Dimension w.r.t. Number of Training Samples. For most genomic applications, the feature dimension is excessively higher than the size of the training data set. Some examples of the sample-feature ratios N/M are

protein sequences -> 1 : 1 microarray data -> 1 : 10 or 1 : 100

Such an extremely high dimensionality has a serious and adverse effect on the performance. First, high dimensionality in feature spaces increases the computational cost in both (1) the learning phase and (2) the prediction phase. In the prediction phase, the more the features used, the more the computation required and the lower the retrieval speed. Fortunately, the prediction time is often linearly proportional to the number of features selected. Unfortunately, in the learning phase, the computational demand may grow exponentially with the number of features. To effectively hold down the cost of computing, the features are usually quantified on either individual or pairwise basis. Nevertheless, the quantification cost is in the order of O(M) and O([M.sup.2]) for individual and pairwise quantification, respectively (see Section 1.2).

3. Plenty of Irrelevant Genes. From the biological viewpoint, only a small portion of genes are strongly indicative of a targeted disease. The remaining "housekeeping" genes would not contribute relevant information. Moreover, their participation in the training and prediction phases could adversely affect the classification performance.

4. Presence of Coexpressed Genes. The presence of coexpressed genes implies that there exists abundant redundancy among the genes. Such redundancy plays a vital role and has a great influence on how to select features as well as how many to select.

5. Insight into Biological Networks. A good feature selection is also essential for us to study the underlying biological process that lead to the type of genomic phenomenon observed. Feature selection can be instrumental for interpretation/ tracking as well as visualization of a selective few of most critical genes for in vitro and in vivo gene profiling experiments. The selective genes closely relevant to a targeted disease are called biomarkers. Concentrating on such a compact subset of biomarkers would facilitate a better interpretation and understanding of the role of the relevant genes. For example, for in vivo microarray data, the size of the subset must be carefully controlled in order to facilitate an effective tracking/interpretation of the underlying regulation behavior and intergene networking.

1.1.2 Reduction of Dimensionality (Computational Perspectives)

High dimensionality in feature spaces also increases uncertainty in classification. An excessive dimensionality could severely jeopardize the generalization capability due to overfitting and unpredictability of the numerical behavior. Thus, feature selection must consider a joint optimization and sometimes a delicate trade-off of the computational cost and prediction performance. Its success lies in a systematic approach to an effective dimension reduction while conceding minimum sacrifice of accuracy.

Recall from Equation 1.3 that the more the features the higher the achievable performance. This results in a monotonically increasing property: the more the features selected, the more the information is made available, as shown in the lower curve in Fig. 1.1a.

However, there are a lot of not-so-informative genomic features that are noisy and unreliable. Their inclusion is actually much more detrimental (than beneficial), especially in terms of numeric computation. Two major and serious adverse effects are elaborated below:

Data Overfitting. Note that overoptimizing the training accuracy as the exclusive performance measure often results in overfitting the data set, which in turn degrades generalization and prediction ability.

It is well known that data overfitting may happen in two situations: one is when the feature dimension is reasonable but too few training data are available; the other is when the feature...

„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.