Verwandte Artikel zu Machine Learning in Bioinformatics (Wiley Series in...

Machine Learning in Bioinformatics (Wiley Series in Bioinformatics) - Hardcover

 
9780470116623: Machine Learning in Bioinformatics (Wiley Series in Bioinformatics)

Inhaltsangabe

An introduction to machine learning methods and their applications to problems in bioinformatics

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization.

From an internationally recognized panel of prominent researchers in the field, Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics. Coverage includes: feature selection for genomic and proteomic data mining; comparing variable selection methods in gene selection and classification of microarray data; fuzzy gene mining; sequence-based prediction of residue-level properties in proteins; probabilistic methods for long-range features in biosequences; and much more.

Machine Learning in Bioinformatics is an indispensable resource for computer scientists, engineers, biologists, mathematicians, researchers, clinicians, physicians, and medical informaticists. It is also a valuable reference text for computer science, engineering, and biology courses at the upper undergraduate and graduate levels.

Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.

Über die Autorin bzw. den Autor

Yan-Qing Zhang, PhD, is an Associate Professor of Computer Science at the Georgia State University, Atlanta. His research interests include hybrid intelligent systems, neural networks, fuzzy logic, evolutionary computation, Yin-Yang computation, granular computing, kernel machines, bioinformatics, medical informatics, computational Web Intelligence, data mining, and knowledge discovery. He has coauthored two books, and edited one book and two IEEE proceedings. He is program co-chair of the IEEE 7th International Conference on Bioinformatics & Bioengineering (IEEE BIBE 2007) and 2006 IEEE International Conference on Granular Computing (IEEE-GrC2006).

Jagath C. Rajapakse, PhD, is Professor of Computer Engineering and Director of the BioInformatics Research Centre, Nanyang Technological University. He is also Visiting Professor in the Department of Biological Engineering, Massachusetts Institute of Technology. He completed his MS and PhD degrees in electrical and computer engineering at University at Buffalo, State University of New York. Professor Rajapakse has published over 210 peer-reviewed research articles in the areas of neuroinformatics and bioinformatics. He serves as Associate Editor for IEEE Transactions on Medical Imaging and IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Von der hinteren Coverseite

An introduction to machine learning methods and their applications to problems in bioinformatics

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization.

From an internationally recognized panel of prominent researchers in the field, Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics. Coverage includes: feature selection for genomic and proteomic data mining; comparing variable selection methods in gene selection and classification of microarray data; fuzzy gene mining; sequence-based prediction of residue-level properties in proteins; probabilistic methods for long-range features in biosequences; and much more.

Machine Learning in Bioinformatics is an indispensable resource for computer scientists, engineers, biologists, mathematicians, researchers, clinicians, physicians, and medical informaticists. It is also a valuable reference text for computer science, engineering, and biology courses at the upper undergraduate and graduate levels.

Auszug. © Genehmigter Nachdruck. Alle Rechte vorbehalten.

Machine Learning in Bioinformatics

John Wiley & Sons

Copyright © 2009 John Wiley & Sons, Inc.
All right reserved.

ISBN: 978-0-470-11662-3

Chapter One

FEATURE SELECTION FOR GENOMIC AND PROTEOMIC DATA MINING

Sun-Yuan Kung and Man-Wai Mak

1.1 INTRODUCTION

The extreme dimensionality (also known as the curse of dimensionality) in genomic data has been traditionally a serious concern in many applications. This has motivated a lot of research in feature representation and selection, both aiming at reducing dimensionality of features to facilitate training and prediction of genomic data.

In this chapter, N denotes the number of training data samples, M the original feature dimension, and the full feature is expressed as an M-dimensional vector process

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The subset of features is denoted as an m-dimensional vector process

y (t) = [[[y.sub.1] (t), [y.sub.2] (t), ..., [y.sub.m](t)].sup.T] (1.1)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.2)

where m [less than or equal to] M and [s.sub.i] stands for index of a selected feature.

From the machine learning's perspective, one metric of special interest is the sample-feature ratio N/M. For many multimedia applications, the sample-feature ratios lie in a desirable range. For example, for speech data, the ratio can be as high as 100: 1 or 1000: 1 in favor of training data size. For machine learning, such a favorable ratio plays a vital role in ensuring the statistical significance of training and validation.

Unfortunately, for genomic data, this is often not the case. It is common that the number of samples is barely compatible with, and sometimes severely outnumbered by, the dimension of features. In such situation, it becomes imperative to remove the less relevant features, that is, features with low signal-to-noise ratio (SNR).

It is commonly acknowledged that more features means more information available for disposal, that is,

I [A] [less than or equal to] I [A [union] B)] [less than or equal to] ..., (1.3)

where A and B represent two features, say [x.sub.i] and [x.sub.j], respectively, and I(X) denotes information of X. However, the redundant and noisy nature of genomic data makes it not always advantageous but sometimes imperative to work with properly selected features.

1.1.1 Reduction of Dimensionality (Biological Perspectives)

In genomic applications, each gene (or protein sequence) corresponds to a feature in gene profiling (or protein sequencing) applications. Feature selection/representation has its own special appeal from the genomic data mining perspective. For example, it is a vital preprocessing stage critical for processing micro array data. For gene expression profiles, the following factors necessitate an efficient gene selection strategy.

1. Unproportionate Feature Dimension w.r.t. Number of Training Samples. For most genomic applications, the feature dimension is excessively higher than the size of the training data set. Some examples of the sample-feature ratios N/M are

protein sequences -> 1 : 1 microarray data -> 1 : 10 or 1 : 100

Such an extremely high dimensionality has a serious and adverse effect on the performance. First, high dimensionality in feature spaces increases the computational cost in both (1) the learning phase and (2) the prediction phase. In the prediction phase, the more the features used, the more the computation required and the lower the retrieval speed. Fortunately, the prediction time is often linearly proportional to the number of features selected. Unfortunately, in the learning phase, the computational demand may grow exponentially with the number of features. To effectively hold down the cost of computing, the features are usually quantified on either individual or pairwise basis. Nevertheless, the quantification cost is in the order of O(M) and O([M.sup.2]) for individual and pairwise quantification, respectively (see Section 1.2).

3. Plenty of Irrelevant Genes. From the biological viewpoint, only a small portion of genes are strongly indicative of a targeted disease. The remaining "housekeeping" genes would not contribute relevant information. Moreover, their participation in the training and prediction phases could adversely affect the classification performance.

4. Presence of Coexpressed Genes. The presence of coexpressed genes implies that there exists abundant redundancy among the genes. Such redundancy plays a vital role and has a great influence on how to select features as well as how many to select.

5. Insight into Biological Networks. A good feature selection is also essential for us to study the underlying biological process that lead to the type of genomic phenomenon observed. Feature selection can be instrumental for interpretation/ tracking as well as visualization of a selective few of most critical genes for in vitro and in vivo gene profiling experiments. The selective genes closely relevant to a targeted disease are called biomarkers. Concentrating on such a compact subset of biomarkers would facilitate a better interpretation and understanding of the role of the relevant genes. For example, for in vivo microarray data, the size of the subset must be carefully controlled in order to facilitate an effective tracking/interpretation of the underlying regulation behavior and intergene networking.

1.1.2 Reduction of Dimensionality (Computational Perspectives)

High dimensionality in feature spaces also increases uncertainty in classification. An excessive dimensionality could severely jeopardize the generalization capability due to overfitting and unpredictability of the numerical behavior. Thus, feature selection must consider a joint optimization and sometimes a delicate trade-off of the computational cost and prediction performance. Its success lies in a systematic approach to an effective dimension reduction while conceding minimum sacrifice of accuracy.

Recall from Equation 1.3 that the more the features the higher the achievable performance. This results in a monotonically increasing property: the more the features selected, the more the information is made available, as shown in the lower curve in Fig. 1.1a.

However, there are a lot of not-so-informative genomic features that are noisy and unreliable. Their inclusion is actually much more detrimental (than beneficial), especially in terms of numeric computation. Two major and serious adverse effects are elaborated below:

Data Overfitting. Note that overoptimizing the training accuracy as the exclusive performance measure often results in overfitting the data set, which in turn degrades generalization and prediction ability.

It is well known that data overfitting may happen in two situations: one is when the feature dimension is reasonable but too few training data are available; the other is when the feature dimension is too high even though there is a reasonable amount of training data. What matters most is the ratio between the feature dimension and the size of the training data set. In short, classification/ generalization depends on the sample-feature ratio.

Unfortunately, for many genomic applications, the feature dimension can be as high or much higher than the size of the training data set. For these applications, overtraining could significantly harm generalization and feature reduction is an effective way to alleviate the overtraining problem.

Suboptimal Search. Practically, the computational resources available for most researchers are deemed to be inadequate, given the astronomical amounts of genomic data to be processed. High dimensionality in feature spaces increases uncertainty in the numerical behaviors. As a result, a computational process often converges to a solution far inferior to the true optimum, which may compromise the prediction accuracy.

In conclusion, when the feature size is too large, the degree of suboptimality must reflect the performance degradation caused by data overfitting and limiting computational resource (see Fig. 1.1b). This implies a nonmonotonic property on achievable performance w.r.t. feature size, as shown in Fig. 1.1c. Accordingly, but not surprisingly, the best performance is often achieved by selecting an optimal subset of features. The use of any oversized feature subsets will be harmful to the performance. Such a nonmonotonic performance curve, together with the concern on the processing speed and cost, prompts the search for an optimal feature selection and dimension reduction.

Before we proceed, let us use a subcellular localization example to highlight the importance of feature selection.

Example 1 (Subcellular localization). Profile alignment support vector machines (SVMs) are applied to predict the subcellular location of proteins in an eukaryotic protein data set provided by Reinhardt and Hubbard. The data set comprises 2427 annotated sequences extracted from SWISSPROT 33.0, which amounts to 684 cytoplasm, 325 extracellular, 321 mitochondrial, and 1097 nuclear proteins. Fivefold cross-validation was used to obtain the prediction accuracy. The accuracy and testing time for different number of features selected by a Fisher-based method are shown in Fig. 1.2. This example offers an evidence of the nonmonotonic performance property based on real genomic data.

1.1.3 How Many Features to Select or Eliminate?

The question now is how many features should be retained, or equivalently how many should be eliminated? There are two ways to determine this number.

1. Predetermined Feature Size. A common practice is to have a user-defined threshold, but it is hard to determine the most appropriate threshold. For some applications, we occasionally may have a good empirical knowledge of the desirable size of the subset. For example, how many genes should be selected from, say, the 7129 genes in the leukemia data set? Some plausible feature dimensions are as follows:

(a) From classification/generalization performance perspective, a sufficient sample-feature ratio would be very desirable. For this case, empirically, an order of 100 genes seems to be a good compromise.

(b) If the study concerns a regulation network, then a few extremely selective genes would allow the tracking and interpretation of cause-effect between them. For such an application, 10 genes would be the right order of magnitude.

(c) For visualization, two to three genes are often selected for simultaneous display.

2. Prespecified Performance Threshold. For most applications, one usually does not know a priori the right size of the subset. Thus, it is useful to have a preliminary indication (formulated in a simple and closed-form mathematical criterion) on the final performance corresponding to a given size. Thereafter, it takes a straightforward practice to select/eliminate the features whose corresponding criterion functions are above/below a predefined threshold.

1.1.4 Unsupervised and Supervised Selection Criteria

The features selected serve very different objectives for unsupervised versus supervised learning scenarios (see Fig 1.3). Therefore, each scenarioinduces its own type of criterion functions.

1.1.4.1 Feature Selection Criteria for Unsupervised Cases In terms of unsupervised cases, there are two very different ways of designing the selection criteria. They depend closely on the performance metric, which can be either fidelity-driven or classification-driven.

1. Fidelity-Driven Criterion. The fidelity-driven criterion is motivated by how much of the original information is retained (or lost) when the feature dimension is reduced. The extent of the pattern spread associated with that feature is evidently reflected in the second-order moment for each feature [x.sub.i], i = 1, ..., M. The larger the second-order moment, the wider the spread, thus the more likely the feature xi contains useful information.

There are two major types of fidelity-driven metrics:

A performance metric could be based on the so-called mutual information: I(x|y).

An alternative measure could be one which minimizes the reconstruction error:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [??]y denotes the estimate of x based on y.

2. Classification-Driven Criterion. From the classification perspective, separability of data subclusters plays an important role. Thus, the corresponding criterion depends on how well can the selected features reveal the subcluster structure. The higher-order statistics, known as independent component analysis (ICA), has been adopted as a popular metric. For more discussion on this subject, see Ref..

1.1.4.2 Feature Selection Criteria for Supervised Cases The ultimate objective for supervised cases lies in a high classification/predition accuracy. Ideally speaking, if the classification information is known, denoted by ITLITL, the simplest criterion will be I(ITLITL|y). However, the comparison between I(ITLITL|x) and I(ITLITL|y) often provides a more useful metric. For example, it is desirable to have

I(ITLITL|y)->I(ITLITL|x),

while keeping the feature dimension m as small as possible. However, the above formulation is numerically difficult to achieve. The only practical solution known to exist is the one making the full use of the feedback from the actual classification result, which is computationally very demanding. (The feedback-based method is related to the wrapper approach to be discussed in Section 1.4.5.)

To overcome this problem, an SNR-type criterion based on the Fisher discriminant analysis is very appealing. (Note that the Fisher discriminant offers a convenient metric to measure the interclass separability embedded in each feature.) Such a feature selection approach entails computing Fisher's discriminant denoted as [FD.sub.i], i = 1, ..., M, which represents the ratio of intercluster distance to intracluster variance for each individual feature. (This related to the filter approach to be discussed in Section 1.4.1.)

1.1.5 Chapter Organization

The organization of the chapter is as follows. Section 1.2 provides a systematic way to quantify the information/redundancy of/among features, which is followed by discussions on the approaches to ranking the relevant features and eliminating the irrelevant ones in Section 1.3. Then, in Section 1.4, two supervised feature selection methods, namely filter and warper, are introduced. For the former, the features are selected without explicit information on classifiers nor classification results, whereas for the latter, the select requires such information explicitly. Section 1.5 introduces a new scenario called self-supervised learning in which prior known group labels are assigned to the features, instead of the vectors. A novel SVM-based feature selection method called Vector-Index-Adaptive SVM, or simply VIA-SVM, is proposed for this new scenario. The chapter finishes with experimental procedures showing how self-supervised learning and VIA-SVM can be applied to (protein-sequence-based) subcellular localization analysis.

1.2 QUANTIFYING INFORMATION/REDUNDANCY OF/AMONG FEATURES

Quantification of information and redundancy depends on how the information is represented. A representative feature is the one that can represent a group of similar features. Denote S as a feature subset, that is, S [equivalent to] {[y.sub.i]}, i = 1, ..., m. In addition to the general case, what of most interest is either a single individual feature m = 1 or a pair of features m = 2. A generic term I(S) will be used temporarily to denote the information pertaining to S, as the exact form of it has to depend on the application scenarios.

(Continues...)


Excerpted from Machine Learning in Bioinformatics Copyright © 2009 by John Wiley & Sons, Inc.. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.

EUR 6,77 für den Versand von Vereinigtes Königreich nach USA

Versandziele, Kosten & Dauer

Suchergebnisse für Machine Learning in Bioinformatics (Wiley Series in...

Beispielbild für diese ISBN

Y Zhang
Verlag: Wiley-Blackwell, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: PBShop.store UK, Fairford, GLOS, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

HRD. Zustand: New. New Book. Shipped from UK. Established seller since 2000. Artikel-Nr. FW-9780470116623

Verkäufer kontaktieren

Neu kaufen

EUR 129,44
Währung umrechnen
Versand: EUR 6,77
Von Vereinigtes Königreich nach USA
Versandziele, Kosten & Dauer

Anzahl: 15 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Zhang; Rajapakse
Verlag: Wiley, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: Ria Christie Collections, Uxbridge, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: New. In. Artikel-Nr. ria9780470116623_new

Verkäufer kontaktieren

Neu kaufen

EUR 132,65
Währung umrechnen
Versand: EUR 13,81
Von Vereinigtes Königreich nach USA
Versandziele, Kosten & Dauer

Anzahl: Mehr als 20 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Rajapakse Jagath C. Zhang Yanqing
Verlag: John Wiley & Sons, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: Majestic Books, Hounslow, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: New. pp. xviii + 456 Illus. Artikel-Nr. 7491686

Verkäufer kontaktieren

Neu kaufen

EUR 172,37
Währung umrechnen
Versand: EUR 7,49
Von Vereinigtes Königreich nach USA
Versandziele, Kosten & Dauer

Anzahl: 3 verfügbar

In den Warenkorb

Foto des Verkäufers

Y Zhang
Verlag: John Wiley & Sons, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: moluna, Greven, Deutschland

Verkäuferbewertung 4 von 5 Sternen 4 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Gebunden. Zustand: New. Yan-Qing Zhang, PhD, is an Associate Professor of Computer Science at the Georgia State University, Atlanta. His research interests include hybrid intelligent systems, neural networks, fuzzy logic, evolutionary computation, Yin-Yang computation, granular co. Artikel-Nr. 446911706

Verkäufer kontaktieren

Neu kaufen

EUR 140,53
Währung umrechnen
Versand: EUR 48,99
Von Deutschland nach USA
Versandziele, Kosten & Dauer

Anzahl: Mehr als 20 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Yanqing Zhang
Verlag: John Wiley and Sons Ltd, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: Kennys Bookstore, Olney, MD, USA

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: New. Machine learning techniques such as Markov models, support vector machines, neural networks, graphical models, etc. , have been successful in analyzing life science data because of their capabilities of handling randomness and uncertainties of data and noise and in generalization. Series: Wiley Series in Bioinformatics. Num Pages: 456 pages, Illustrations. BIC Classification: PBW; PSAK; TJ. Category: (P) Professional & Vocational. Dimension: 237 x 164 x 27. Weight in Grams: 788. . 2008. 1st Edition. Hardcover. . . . . Books ship from the US and Ireland. Artikel-Nr. V9780470116623

Verkäufer kontaktieren

Neu kaufen

EUR 217,33
Währung umrechnen
Versand: EUR 9,08
Innerhalb der USA
Versandziele, Kosten & Dauer

Anzahl: Mehr als 20 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Zhang, Yanqing/ Rajapakse, Jagath Chandana
Verlag: John Wiley & Sons Inc, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: Revaluation Books, Exeter, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Hardcover. Zustand: Brand New. 1st edition. 456 pages. 9.29x6.22x1.02 inches. In Stock. Artikel-Nr. x-0470116625

Verkäufer kontaktieren

Neu kaufen

EUR 200,60
Währung umrechnen
Versand: EUR 28,82
Von Vereinigtes Königreich nach USA
Versandziele, Kosten & Dauer

Anzahl: 2 verfügbar

In den Warenkorb

Foto des Verkäufers

Yanqing Zhang
Verlag: Wiley Dez 2008, 2008
ISBN 10: 0470116625 ISBN 13: 9780470116623
Neu Hardcover

Anbieter: AHA-BUCH GmbH, Einbeck, Deutschland

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Buch. Zustand: Neu. Neuware - An introduction to machine learning methods and their applications to problems in bioinformaticsMachine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization.From an internationally recognized panel of prominent researchers in the field, Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics. Coverage includes: feature selection for genomic and proteomic data mining; comparing variable selection methods in gene selection and classification of microarray data; fuzzy gene mining; sequence-based prediction of residue-level properties in proteins; probabilistic methods for long-range features in biosequences; and much more.Machine Learning in Bioinformatics is an indispensable resource for computer scientists, engineers, biologists, mathematicians, researchers, clinicians, physicians, and medical informaticists. It is also a valuable reference text for computer science, engineering, and biology courses at the upper undergraduate and graduate levels. Artikel-Nr. 9780470116623

Verkäufer kontaktieren

Neu kaufen

EUR 171,80
Währung umrechnen
Versand: EUR 64,42
Von Deutschland nach USA
Versandziele, Kosten & Dauer

Anzahl: 2 verfügbar

In den Warenkorb