Data Mining for the Social Sciences: An Introduction - Softcover

Attewell, Paul; Monaghan, David B.

 
9780520280984: Data Mining for the Social Sciences: An Introduction

Inhaltsangabe



We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Additionally, powerful algorithms are capable of churning through seas of data to uncover patterns. Providing a simple and accessible introduction to data mining, Paul Attewell and David B. Monaghan discuss how data mining substantially differs from conventional statistical modeling familiar to most social scientists. The authors also empower social scientists to tap into these new resources and incorporate data mining methodologies in their analytical toolkits. Data Mining for the Social Sciences demystifies the process by describing the diverse set of techniques available, discussing the strengths and weaknesses of various approaches, and giving practical demonstrations of how to carry out analyses using tools in various statistical software packages.

Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.

Über die Autorin bzw. den Autor

Paul Attewell is Distinguished Professor of Sociology at the Graduate Center of the City University of New York, where he teaches doctoral level courses on quantitative methods including data mining and other courses on the sociology of education and on social stratification. Professor Attewell is the principal investigator of a grant from the National Science Foundation that supports an interdisciplinary initiative on data mining in the social and behavioral sciences and education. In projects funded by the Spencer and Gates and Ford Foundations, Paul Attewell has also studied issues of access and inequality in K-12 schools and in higher education. One of his previous books, Passing the Torch: Does Higher Education for the Disadvantaged Pay Off Across the Generations?, won the Grawemeyer Prize in Education and the American Education Research Association’s prize for outstanding book in 2009.

David B. Monaghan is a doctoral candidate in Sociology at the Graduate Center of the City University of New York, and has taught courses on quantitative research methods, demography, and education. His research is focused on the relationship between higher education and social stratification.

Von der hinteren Coverseite

"Attewell and Monaghan show us how to find our way in the mountains of big data generated by administrative records, commercial transactions, and online traffic. They explain clearly why our usual methods won't work and how social scientists can apply data science innovations to answer our questions. Their introduction is useful now and also prepares us for developments yet to come."—Michael Hout, New York University

“The analysis of big data is becoming a core enterprise in social research. Paul Attewell and David Monaghan provide an excellent and accessible introduction to data mining tools that forward-looking social scientists can use for analyzing such data. I highly recommend it for experienced researchers and graduate students alike.”—Glenn Firebaugh, Seven Rules for Social Research

"Most social scientists are unaware of the latest data mining tools that are available to help them discover patterns in the data they analyze. With lucid prose, clear examples, and sound practical advice, Data Mining for the Social Sciences will open the eyes of many."—Stephen L. Morgan, Johns Hopkins University

Aus dem Klappentext

"Attewell and Monaghan show us how to find our way in the mountains of big data generated by administrative records, commercial transactions, and online traffic. They explain clearly why our usual methods won't work and how social scientists can apply data science innovations to answer our questions. Their introduction is useful now and also prepares us for developments yet to come." Michael Hout, New York University

The analysis of big data is becoming a core enterprise in social research. Paul Attewell and David Monaghan provide an excellent and accessible introduction to data mining tools that forward-looking social scientists can use for analyzing such data. I highly recommend it for experienced researchers and graduate students alike. Glenn Firebaugh, Seven Rules for Social Research

"Most social scientists are unaware of the latest data mining tools that are available to help them discover patterns in the data they analyze. With lucid prose, clear examples, and sound practical advice, Data Mining for the Social Sciences will open the eyes of many." Stephen L. Morgan, Johns Hopkins University

Auszug. © Genehmigter Nachdruck. Alle Rechte vorbehalten.

Data Mining for the Social Sciences

An Introduction

By Paul Attewell, David B. Monaghan, Darren Kwong

UNIVERSITY OF CALIFORNIA PRESS

Copyright © 2015 Paul Attewell and David B. Monaghan
All rights reserved.
ISBN: 978-0-520-28098-4

Contents

Acknowledgments, xi,
PART 1. CONCEPTS,
1. What Is Data Mining?, 3,
2. Contrasts with the Conventional Statistical Approach, 13,
3. Some General Strategies Used in Data Mining, 30,
4. Important Stages in a Data Mining Project, 53,
PART 2. WORKED EXAMPLES,
5. Preparing Training and Test Datasets, 63,
6. Variable Selection Tools, 72,
7. Creating New Variables Using Binning and Trees, 93,
8. Extracting Variables, 116,
9. Classifiers, 133,
10. Classification Trees162,
11. Neural Networks, 185,
12. Clustering, 196,
13. Latent Class Analysis and Mixture Models, 216,
14. Association Rules, 227,
Conclusion, 235,
Bibliography, 239,
Notes, 245,
Index, 247,


CHAPTER 1

WHAT IS DATA MINING?


Data mining (DM) is the name given to a variety of computer-intensive techniques for discovering structure and for analyzing patterns in data. Using those patterns, DM can create predictive models, or classify things, or identify different groups or clusters of cases within data. Data mining and its close cousins machine learning and predictive analytics are already widely used in business and are starting to spread into social science and other areas of research.

A partial list of current data mining methods includes:

• association rules

• recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction detection), boosted trees, forests, and bootstrap forests

• multi-layer neural network models and "deep learning" methods

• naive Bayes classifiers and Bayesian networks

• clustering methods, including hierarchical, k-means, nearest neighbor, linear and nonlinear manifold clustering

• support vector machines

• "soft modeling" or partial least squares latent variable modeling


DM is a young area of scholarship, but it is growing very rapidly As we speak, new methods are appearing, old ones are being modified, and strategies and skills in using these methods are accumulating. The potential and importance of DM are becoming widely recognized. In just the last two years the National Science Foundation has poured millions of dollars into new research initiatives in this area.

DM methods can be applied to quite different domains, for example to visual data, in reading handwriting or recognizing faces within digital pictures. DM is also being used to analyze texts—for example to classify the content of scientific papers or other documents—hence the term text mining. In addition, DM analytics can be applied to digitized sound, to recognize words in phone conversations, for example. In this book, however, we focus on the most common domain: the use of DM methods to analyze quantitative or numerical data.

Miners look for veins of ore and extract these valuable parts from the surrounding rock. By analogy, data mining looks for patterns or structure in data. But what does it mean to say that we look for structure in data? Think of a computer screen that displays thousands of pixels, points of light or dark. Those points are raw data. But if you scan those pixels by eye and recognize in them the shapes of letters and words, then you are finding structures in the data—or, to use another metaphor, you are turning data into information.

The equivalent to the computer screen for numerical data is a spreadsheet or matrix, where each column represents a single variable and each row contains data for a different case or person. Each cell within the spreadsheet contains a specific value for one person on one particular variable.

How do you recognize patterns or regularities or structures in this kind of raw numerical data? Statistics provides various ways of expressing the relations between the columns and rows of data in a spreadsheet. The most familiar one is a correlation matrix. Instead of repeating the raw data, with its thousands of observations and dozens of variables, a correlation matrix represents just the relations between each variable and each other variable. It is a summary, a simplification of the raw data.

Few of us can read a correlation matrix easily, or recognize a meaningful pattern in it, so we typically go through a second step in looking for structures in numerical data. We create a model that summarizes the relations in the correlation matrix. An ordinary least squares (OLS) regression model is one common example. It translates a correlation matrix into a much smaller regression equation that we can more easily understand and interpret.

A statistical model is more than just a summary derived from raw data, though. It is also a tool for prediction, and it is this second property that makes DM especially useful. Banks accumulate huge databases about customers, including records of who defaulted on loans. If bank analysts can turn those data into a model to accurately predict who will default on a loan, then they can reject the riskiest new loan applications and avoid losses. If Amazon.com can accurately assess your tastes in books, based on your previous purchases and your similarity to other customers, and then tempt you with a well-chosen book recommendation, then the company will make more profit. If a physician can obtain an NMR scan of cell tissue and predict from that data whether a tumor is likely to be malignant or benign, then the doctor has a powerful tool at her disposal.

Our world is awash with digital data. By finding patterns in data, especially patterns that can accurately predict important outcomes, DM is providing a very valuable service. Accurate prediction can inform a decision and lead to an action. If that cell tissue is most likely malignant, then one should schedule surgery. If that person's predicted risk of default is high, then don't approve the loan.

But why do we need DM for this? Wouldn't traditional statistical methods fulfill the same function just as well?

Conventional statistical methods do provide predictive models, but they have significant weaknesses. DM methods offer an alternative to conventional methods, in some cases a superior alternative that is less subject to those problems. We will later enumerate several advantages of DM, but for now we point out just the most obvious one. DM is especially well suited to analyzing very large datasets with many variables and/or many cases—what's known as Big Data.

Conventional statistical methods sometimes break down when applied to very large datasets, either because they cannot handle the computational aspects, or because they face more fundamental barriers to estimation. An example of the latter is when a dataset contains more variables than observations, a combination that conventional regression models cannot handle, but that several DM methods can.

DM not only overcomes certain limitations of conventional statistical methods, it also helps transcend some human limitations. A researcher faced with a dataset containing hundreds of variables and many thousands of cases is likely to overlook important features of the data because of limited time and attention. It is relatively easy, for example, to inspect a...

„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.

Weitere beliebte Ausgaben desselben Titels

9780520280977: Data Mining for the Social Sciences: An Introduction

Vorgestellte Ausgabe

ISBN 10:  0520280970 ISBN 13:  9780520280977
Verlag: University of California Press, 2015
Hardcover