Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.
Acknowledgments, xi,
PART 1. CONCEPTS,
1. What Is Data Mining?, 3,
2. Contrasts with the Conventional Statistical Approach, 13,
3. Some General Strategies Used in Data Mining, 30,
4. Important Stages in a Data Mining Project, 53,
PART 2. WORKED EXAMPLES,
5. Preparing Training and Test Datasets, 63,
6. Variable Selection Tools, 72,
7. Creating New Variables Using Binning and Trees, 93,
8. Extracting Variables, 116,
9. Classifiers, 133,
10. Classification Trees162,
11. Neural Networks, 185,
12. Clustering, 196,
13. Latent Class Analysis and Mixture Models, 216,
14. Association Rules, 227,
Conclusion, 235,
Bibliography, 239,
Notes, 245,
Index, 247,
WHAT IS DATA MINING?
Data mining (DM) is the name given to a variety of computer-intensive techniques for discovering structure and for analyzing patterns in data. Using those patterns, DM can create predictive models, or classify things, or identify different groups or clusters of cases within data. Data mining and its close cousins machine learning and predictive analytics are already widely used in business and are starting to spread into social science and other areas of research.
A partial list of current data mining methods includes:
• association rules
• recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction detection), boosted trees, forests, and bootstrap forests
• multi-layer neural network models and "deep learning" methods
• naive Bayes classifiers and Bayesian networks
• clustering methods, including hierarchical, k-means, nearest neighbor, linear and nonlinear manifold clustering
• support vector machines
• "soft modeling" or partial least squares latent variable modeling
DM is a young area of scholarship, but it is growing very rapidly As we speak, new methods are appearing, old ones are being modified, and strategies and skills in using these methods are accumulating. The potential and importance of DM are becoming widely recognized. In just the last two years the National Science Foundation has poured millions of dollars into new research initiatives in this area.
DM methods can be applied to quite different domains, for example to visual data, in reading handwriting or recognizing faces within digital pictures. DM is also being used to analyze texts—for example to classify the content of scientific papers or other documents—hence the term text mining. In addition, DM analytics can be applied to digitized sound, to recognize words in phone conversations, for example. In this book, however, we focus on the most common domain: the use of DM methods to analyze quantitative or numerical data.
Miners look for veins of ore and extract these valuable parts from the surrounding rock. By analogy, data mining looks for patterns or structure in data. But what does it mean to say that we look for structure in data? Think of a computer screen that displays thousands of pixels, points of light or dark. Those points are raw data. But if you scan those pixels by eye and recognize in them the shapes of letters and words, then you are finding structures in the data—or, to use another metaphor, you are turning data into information.
The equivalent to the computer screen for numerical data is a spreadsheet or matrix, where each column represents a single variable and each row contains data for a different case or person. Each cell within the spreadsheet contains a specific value for one person on one particular variable.
How do you recognize patterns or regularities or structures in this kind of raw numerical data? Statistics provides various ways of expressing the relations between the columns and rows of data in a spreadsheet. The most familiar one is a correlation matrix. Instead of repeating the raw data, with its thousands of observations and dozens of variables, a correlation matrix represents just the relations between each variable and each other variable. It is a summary, a simplification of the raw data.
Few of us can read a correlation matrix easily, or recognize a meaningful pattern in it, so we typically go through a second step in looking for structures in numerical data. We create a model that summarizes the relations in the correlation matrix. An ordinary least squares (OLS) regression model is one common example. It translates a correlation matrix into a much smaller regression equation that we can more easily understand and interpret.
A statistical model is more than just a summary derived from raw data, though. It is also a tool for prediction, and it is this second property that makes DM especially useful. Banks accumulate huge databases about customers, including records of who defaulted on loans. If bank analysts can turn those data into a model to accurately predict who will default on a loan, then they can reject the riskiest new loan applications and avoid losses. If Amazon.com can accurately assess your tastes in books, based on your previous purchases and your similarity to other customers, and then tempt you with a well-chosen book recommendation, then the company will make more profit. If a physician can obtain an NMR scan of cell tissue and predict from that data whether a tumor is likely to be malignant or benign, then the doctor has a powerful tool at her disposal.
Our world is awash with digital data. By finding patterns in data, especially patterns that can accurately predict important outcomes, DM is providing a very valuable service. Accurate prediction can inform a decision and lead to an action. If that cell tissue is most likely malignant, then one should schedule surgery. If that person's predicted risk of default is high, then don't approve the loan.
But why do we need DM for this? Wouldn't traditional statistical methods fulfill the same function just as well?
Conventional statistical methods do provide predictive models, but they have significant weaknesses. DM methods offer an alternative to conventional methods, in some cases a superior alternative that is less subject to those problems. We will later enumerate several advantages of DM, but for now we point out just the most obvious one. DM is especially well suited to analyzing very large datasets with many variables and/or many cases—what's known as Big Data.
Conventional statistical methods sometimes break down when applied to very large datasets, either because they cannot handle the computational aspects, or because they face more fundamental barriers to estimation. An example of the latter is when a dataset contains more variables than observations, a combination that conventional regression models cannot handle, but that several DM methods can.
DM not only overcomes certain limitations of conventional statistical methods, it also helps transcend some human limitations. A researcher faced with a dataset containing hundreds of variables and many thousands of cases is likely to overlook important features of the data because of limited time and attention. It is relatively easy, for example, to inspect a...
„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.
Anbieter: World of Books (was SecondSale), Montgomery, IL, USA
Zustand: Good. Item in good condition. Textbooks may not include supplemental items i.e. CDs, access codes etc. Artikel-Nr. 00100734776
Anzahl: 2 verfügbar
Anbieter: PBShop.store US, Wood Dale, IL, USA
PAP. Zustand: New. New Book. Shipped from UK. Established seller since 2000. Artikel-Nr. WF-9780520280984
Anbieter: PBShop.store UK, Fairford, GLOS, Vereinigtes Königreich
PAP. Zustand: New. New Book. Shipped from UK. Established seller since 2000. Artikel-Nr. WF-9780520280984
Anzahl: 12 verfügbar
Anbieter: Majestic Books, Hounslow, Vereinigtes Königreich
Zustand: New. 216. Artikel-Nr. 373748188
Anzahl: 3 verfügbar
Anbieter: Kennys Bookstore, Olney, MD, USA
Zustand: New. We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Providing an introduction to data mining, the authors discuss how data mining substantially differs from conventional statistical modeling familiar to most social scientists. Num Pages: 264 pages, illustrations. BIC Classification: JHBC; UNF. Category: (G) General (US: Trade). Dimension: 180 x 253 x 18. Weight in Grams: 570. . 2015. Paperback. . . . . Books ship from the US and Ireland. Artikel-Nr. V9780520280984
Anzahl: 12 verfügbar
Anbieter: Revaluation Books, Exeter, Vereinigtes Königreich
Paperback. Zustand: Brand New. 215 pages. 10.00x7.00x0.75 inches. In Stock. Artikel-Nr. x-0520280989
Anzahl: 2 verfügbar
Anbieter: moluna, Greven, Deutschland
Zustand: New. We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Providing an introduction to data mining, the authors discuss how data mining substantially. Artikel-Nr. 31216151
Anzahl: Mehr als 20 verfügbar
Anbieter: AHA-BUCH GmbH, Einbeck, Deutschland
Taschenbuch. Zustand: Neu. Neuware - We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Providing an introduction to data mining, the authors discuss how data mining substantially differs from conventional statistical modeling familiar to most social scientists. Artikel-Nr. 9780520280984
Anzahl: 2 verfügbar