Volume 16
Reviews In Computational Chemistry
Kenny B. Lipkowitz and Donald B. Boyd
The focus of this book is on methods useful in molecular design. Tutorials and reviews span (1) methods for designing compound libraries for combinatorial chemistry and high throughput screening, (2) the workings of artificial neural networks and their use in chemistry, (3) force field methods for modeling materials and designing new substances, and (4) free energy perturbation methods of practical usefulness in ligand design.
From Reviews of the Series
"This series spans all the subdisciplines in the field, from techniques to practical applications, and includes reviews from many of the acknowledged leaders in the field. the reviews cross many subdisciplines yet are both general enough to be of wide interest while including detailed information of use to workers in particular subdisciplines." -Journal of the American Chemical Society
Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.
Kenny B. Lipkowitz, PhD, is a retired Professor of Chemistry from North Dakota State University.
Donald B. Boyd was apponted Research Professor of Chemistry at Indiana University - Purdue University Indianapolis in 1994. He has published over 100 refereed journal papers and book chapters.
Volume 16
Reviews In Computational Chemistry
Kenny B. Lipkowitz and Donald B. Boyd
The focus of this book is on methods useful in molecular design. Tutorials and reviews span (1) methods for designing compound libraries for combinatorial chemistry and high throughput screening, (2) the workings of artificial neural networks and their use in chemistry, (3) force field methods for modeling materials and designing new substances, and (4) free energy perturbation methods of practical usefulness in ligand design.
From Reviews of the Series
"This series spans all the subdisciplines in the field, from techniques to practical applications, and includes reviews from many of the acknowledged leaders in the field. the reviews cross many subdisciplines yet are both general enough to be of wide interest while including detailed information of use to workers in particular subdisciplines." -Journal of the American Chemical Society
Volume 16
Reviews In Computational Chemistry Kenny B. Lipkowitz and Donald B. Boyd The focus of this book is on methods useful in molecular design. Tutorials and reviews span (1) methods for designing compound libraries for combinatorial chemistry and high throughput screening, (2) the workings of artificial neural networks and their use in chemistry, (3) force field methods for modeling materials and designing new substances, and (4) free energy perturbation methods of practical usefulness in ligand design. From Reviews of the Series This series spans all the subdisciplines in the field, from techniques to practical applications, and includes reviews from many of the acknowledged leaders in the field. the reviews cross many subdisciplines yet are both general enough to be of wide interest while including detailed information of use to workers in particular subdisciplines. -Journal of the American Chemical SocietyGeoff M. Downs and John M. Barnard
Barnard Chemical Information Ltd., 46 Uppergate Road, Stannington, Sheffield S6 6BX, United Kingdom
INTRODUCTION
Clustering is a data analysis technique that, when applied to a set of heterogeneous items, identifies homogeneous subgroups as defined by a given model or measure of similarity. Of the many uses of clustering, a prime motivation for the increasing interest in clustering methods is their use in the selection and design of combinatorial libraries of chemical structures pertinent to pharmaceutical discovery.
One feature of clustering is that the process is unsupervised, that is, there is no predefined grouping that the clustering seeks to reproduce. In contrast to supervised learning, where the task is to establish relationships between given inputs and outputs to enable prediction of the output from new inputs, in unsupervised learning only the inputs are available and the task is to reveal aspects of the underlying distribution of the input data. Clustering is thus complemented by the related supervised process of classification, in which items are assigned labels applied to predefined groups: examples include recursive partitioning, naive Bayesian analysis, and K nearest-neighbor selection. Clustering is a technique for exploratory data analysis and is used increasingly in preliminary analyses of large data sets of medium and high dimensionality as a method of selection, diversity analysis, and data reduction. This chapter reviews the main clustering methods that are used for analyzing chemical data sets and gives examples of their application in pharmaceutical companies. Compared to the other costs of drug discovery, clustering can add significant value at minimal cost. First, we provide an outline of clustering as a discipline and define some of the terminology. Then, we give a brief tutorial on clustering algorithms, review progress in developing the methods, and offer some example applications.
Clustering methodology has been developed and used in a variety of areas including archaeology, astronomy, biology, computer science, electronics, engineering, information science, and medicine. Good, general introductory texts on the topic of clustering include those by Sneath and Sokal, Kaufmann and Rousseeuw, Everitt, and Gordon. The main text that is devoted to clustering of chemical data sets is by Willett, with review articles by Bratchell, Barnard and Downs, and Downs and Willett. The present chapter is a complement and update to the latter article. In a previous volume of this series, Lewis, Pickett, and Clark reviewed the use of diversity analysis techniques in combinatorial library design.
As will be shown in the section on Chemical Applications, the current main uses of clustering for chemical data sets are to find representative subsets from high throughput screening (HTS) and combinatorial chemistry, and to increase the diversity of in-house data sets through selection of additional compounds from other data sets. Methods suitable for compound selection are the main focus of this chapter. The methods must be able to handle large data sets of high-dimensional data. For small, low-dimensional data sets, most clustering methods are applicable, and descriptions in the standard texts and implementations available in standard statistical software packages suffice. Implementations designed for use on chemical data sets are available from most of the specialist software vendors, the majority of which were reviewed by Warr.
The overall process of clustering involves the following steps:
1. Generate appropriate descriptors for each compound in the data set.
2. Select an appropriate similarity measure.
3. Use an appropriate clustering method to cluster the data set.
4. Analyze the results.
This chapter focuses on step 3. For step 1, descriptors may include property values, biological properties, topological indexes, and structural fragments. The performance of these descriptors and forms of representation have been analyzed by Brown and Brown and Martin. Similarity searching for step 2 has been discussed by Downs and Willett; characteristics of various similarity measures have been discussed by Barnard, Downs, and Willett. For step 4, little has been published specifically about visualization and analysis of results for chemical data sets. However, most publications that focus on implementing systems that utilize clustering do provide details of how the results were displayed or analyzed.
The terminology associated with clustering is extensive, with many terms used to describe the same thing (reflecting the separate development of clustering methods within a multitude of disciplines). Clusters can be overlapping or nonoverlapping; if a compound occurs in more than one cluster, the clusters are overlapping. At one extreme, each compound is a member of all clusters to a certain degree. An example of this is fuzzy clustering in which the degree of membership of an individual compound is in the range 0 to 1, and the total membership summed across all clusters is normally required to be 1. This scheme contrasts with crisp clustering in which each compound's degree of membership in any cluster is either 0 or 1. At the other extreme, is the situation wherein each compound is a member of exactly one cluster, in which case the clusters are said to be nonoverlapping. Intermediate situations sometimes occur, where compounds can be members of several, though not of all, clusters. The majority of clustering methods used on chemical data sets generate crisp, nonoverlapping clusters, because analysis of such clusters is relatively simple.
If a data set is analyzed in an iterative way, such that at each step a pair of clusters is merged or a single cluster is divided, the result is hierarchical, with a parent-child relationship being established between clusters at each successive level of the iteration. The successive levels can be visualized using a dendrogram, as shown in Figure 1. Each level of the hierarchy represents a partitioning of the data set into a set of clusters. In contrast, if the data set is analyzed to produce a single partition of the compounds resulting in a set of clusters, the result is then nonhierarchical. Note that the term partitioning in this context is different from the technique of partitioning (otherwise known as cell-based partitioning). The latter technique is a method of classification rather than of clustering, and a useful review of it, as applied to chemical data sets, is given by Mason and Pickett. A broad classification of the most common clustering methods is shown in Figure 2. Note that, with the wide range of clustering methods devised, some can be placed in more than one of the given categories.
If a hierarchical method starts with all compounds as singletons (in clusters by themselves) and the latter are merged iteratively until all compounds are in a single cluster, the method is said to be agglomerative. With respect to the dendrogram in Figure 1, it is a bottom-up approach. If the hierarchical method starts with all compounds in a single cluster and iteratively splits one cluster into two until all compounds are singletons, the method is divisive, that is, it is a top-down approach. If, at each split, only one descriptor is used to determine how the cluster is split, the method is monothetic; otherwise, more descriptors (typically all available) are used, and the method is polythetic.
Nonhierarchical methods encompass a wide range of different techniques to build clusters. A single-pass method is one in which the partition is created by a single pass through the data set or, if randomly accessed, in which each compound is examined only once to decide which cluster it should be assigned to. A relocation method is one in which compounds are moved from one cluster to another to try to improve on the initial estimation of the clusters. The relocating is typically accomplished based on improving a cost function describing the "goodness" of each resultant cluster. The nearest-neighbor approach is more compound centered than are the other nonhierarchical methods. In it, the environment around each compound is examined in terms of its most similar neighboring compounds, with commonality between nearest neighbors being used as a criterion for cluster formation. In mixture model clustering the data are assumed to exist as a mixture of densities that are usually assumed to be Gaussian (normal) distributions, since their densities are not known in advance. Solutions to the mixture model are derived iteratively in a manner similar to the relocation methods. Topographic methods, such as use of Kohonen maps, typically apply a variable cost function with the added restriction that topographic relationships are preserved so that neighboring clusters are close in descriptor space. Other nonhierarchical methods include density-based and probabilistic methods. Density-based, or mode-seeking, methods regard the distribution of descriptors across the data set as generating patterns of high and low density that, when identified, can be used to separate the compounds into clusters. Probabilistic clustering generates nonoverlapping clusters in which a compound is assigned a probability, in the range 0 to 1, that it belongs to the chosen cluster (in contrast to fuzzy clustering in which the clusters are overlapping and the degree of membership is not a probability).
Having now provided a broad overview of clustering methodology, we next focus on the "classical" methods, which include hierarchical and single-pass, relocation, and nearest-neighbor nonhierarchical techniques. The classification we have described in Figure 2 is one that is commonly used by many scientists; however, it is just one of many possible classifications. Another way to differentiate between clustering techniques is to consider parametric and nonparametric methods. Parametric methods require distance-based comparisons be made. Here access to the descriptors is required (typically given as Euclidean vectors), rather than just a proximity matrix derived from the descriptors. Parametric methods can be further organized into generative and reconstructive methods. Generative methods, including mixture model, density-based, and probabilistic techniques, try to match parameters (e.g., cluster centers, variances within and between clusters, and mixing coefficients for the descriptor distributions) to the distribution of descriptors within the data set. Reconstructive methods, such as relocation and topographic, are based upon improving a given cost function. Nonparametric methods make fewer assumptions about the underlying data; they do not adapt given parameters iteratively and, in general, need only a matrix of pairwise proximities (i.e., a distance matrix).
The term proximity is used here to include similarity and dissimilarity coefficients in addition to distance measures. Individual proximity measures are not defined in this review; full definitions can be found in standard texts and in the articles by Barnard, Downs, and Willett. We now define the terms centroid and square-error, because they will be used throughout this chapter. For a cluster of s compounds each represented by a vector, let x(r) be the rth vector. The vector of the cluster centroid, x(c), is then defined as
[1] x(c) = (1/s) s]summation over (r = 1)] x(r)
Note that the centroid is the simple arithmetic mean of the vectors of the cluster members, and this mean is frequently used to represent the cluster as a whole. In situations where a mean is not applicable or appropriate, the median can be used to define the cluster medoid (see Kaufman and Rousseeuw2 for details). The square-error (also called the within-cluster variance), [e.sup.2], for a cluster is the sum of squared Euclidean distances to the centroid or medoid for all s items in that cluster:
[2] [e.sup.2] = s [summation over (r = 1)] [x(r) - x (c)].sup.2]
The square-error across all K clusters in a partition is the sum of the square-errors for each of the K clusters. (Note also that the standard deviation would be the square root of the square-error.)
CLUSTERING ALGORITHMS
This chapter concentrates on the "classical" clustering methods, because they are the methods that have been applied most often in the chemical community. Standard reference works devoted to clustering algorithms include those by Hartigan, Murtagh, and Jain and Dubes.
Hierarchical Methods
Hierarchical Agglomerative
The most commonly implemented hierarchical clustering methods are those belonging to the family of sequential agglomerative hierarchical nonoverlapping (SAHN) methods. These are traditionally implemented using what is known as the stored-matrix algorithm, so named because the starting point is a matrix of all pairwise proximities between items in the data set to be clustered. Each cluster initially corresponds to an individual item (singleton). As clustering proceeds, each cluster may contain one or more items. Eventually, there evolves one cluster that contains all items. At each iteration, a pair of clusters is merged (agglomerated) and the number of clusters decreases by 1. The stored-matrix algorithm proceeds as follows:
1. Calculate the initial proximity matrix containing the pairwise proximities between all pairs of clusters (singletons) in the data set.
2. Scan the matrix to find the most similar pair of clusters, and merge them into a new cluster (thus replacing the original pair).
3. Update the proximity matrix by inactivating one set of entries of the original pair and updating the other set (now representing the new cluster) with the proximities between the new cluster and all other clusters.
4. Repeat steps 2 and 3 until just one cluster remains.
The various SAHN methods differ in the way in which the proximity between clusters is defined in step 1 and how the merged pair is represented as a single cluster in step 3. The proximity calculation in step 3 typically uses the Lance-Williams matrix-update formula:
[3] [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where d]k, (i,j)] is the proximity between cluster k and the cluster (i, j) formed from merging clusters i and j. Different values for [[alpha].sub.i], [[alpha].sub.j], [beta], and [lambda] define various SAHN methods, some of which are shown in Table 1 and described below.
Continues...
Excerpted from Reviews in Computational Chemistryby Donald B. Boyd Copyright © 2000 by Donald B. Boyd. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.
Anbieter: ThriftBooks-Atlanta, AUSTELL, GA, USA
Hardcover. Zustand: Very Good. No Jacket. Former library book; May have limited writing in cover pages. Pages are unmarked. ~ ThriftBooks: Read More, Spend Less 1.38. Artikel-Nr. G0471386677I4N10
Anbieter: moluna, Greven, Deutschland
Gebunden. Zustand: New. Describes computational chemistry tools useful for molecular design and other research applications. (SciTech Book News, March 2001) .it is absolutely impossible without help to follow all the fields, sub-fields.that have branched off [of computationa. Artikel-Nr. 446916287
Anzahl: Mehr als 20 verfügbar
Anbieter: Majestic Books, Hounslow, Vereinigtes Königreich
Zustand: New. pp. 368. Artikel-Nr. 6447388
Anzahl: 3 verfügbar
Anbieter: Revaluation Books, Exeter, Vereinigtes Königreich
Hardcover. Zustand: Brand New. 342 pages. 9.25x6.25x0.75 inches. In Stock. Artikel-Nr. x-0471386677
Anzahl: 2 verfügbar
Anbieter: AHA-BUCH GmbH, Einbeck, Deutschland
Buch. Zustand: Neu. Neuware - Führende Spezialisten der Computerchemie erläutern Ihnen auch in diesem 16. Band der bewährten Reihe Methoden zur Berechnung von Moleküleigenschaften. Auf ein Übermaß an mathematischen Herleitungen wird dabei verzichtet - so können auch Studenten oder Forscher, die sich nicht ständig mit Computerchemie beschäftigen, die Verfahren verstehen. Themen dieses Bandes sind unter anderem der Entwurf kombinatorischer Molekülbibliotheken, die Anwendung neuronaler Netze in der Chemie, Kraftfelder zur Modellierung von Werkstoffen und Methoden zur Vorhersage der Bindungsaffinität von Liganden. (09/00). Artikel-Nr. 9780471386674
Anzahl: 2 verfügbar
Anbieter: Kennys Bookstore, Olney, MD, USA
Zustand: New. Volume 16 Reviews In Computational Chemistry Kenny B. Lipkowitz and Donald B. Boyd The focus of this book is on methods useful in molecular design. Series: Reviews in Computational Chemistry. Num Pages: 368 pages, Illustrations. BIC Classification: PDN; PN; UY. Category: (P) Professional & Vocational; (UP) Postgraduate, Research & Scholarly; (UU) Undergraduate. Dimension: 243 x 165 x 26. Weight in Grams: 684. . 2000. Volume 16. Hardcover. . . . . Books ship from the US and Ireland. Artikel-Nr. V9780471386674
Anzahl: Mehr als 20 verfügbar