Verwandte Artikel zu Modeling with Data: Tools and Techniques for Scientific...

Modeling with Data: Tools and Techniques for Scientific Computing - Hardcover

 
9780691133140: Modeling with Data: Tools and Techniques for Scientific Computing

Inhaltsangabe

Modeling with Data fully explains how to execute computationally intensive analyses on very large data sets, showing readers how to determine the best methods for solving a variety of different problems, how to create and debug statistical models, and how to run an analysis and evaluate the results.


Ben Klemens introduces a set of open and unlimited tools, and uses them to demonstrate data management, analysis, and simulation techniques essential for dealing with large data sets and computationally intensive procedures. He then demonstrates how to easily apply these tools to the many threads of statistical technique, including classical, Bayesian, maximum likelihood, and Monte Carlo methods. Klemens's accessible survey describes these models in a unified and nontraditional manner, providing alternative ways of looking at statistical concepts that often befuddle students. The book includes nearly one hundred sample programs of all kinds. Links to these programs will be available on this page at a later date.



Modeling with Data will interest anyone looking for a comprehensive guide to these powerful statistical tools, including researchers and graduate students in the social sciences, biology, engineering, economics, and applied mathematics.

Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.

Über die Autorin bzw. den Autor

Ben Klemens is a senior statistician at the National Institute of Mental Health. He is also a guest scholar at the Center on Social and Economic Dynamics at the Brookings Institution.

Von der hinteren Coverseite

"I am a psychiatric geneticist but my degree is in neuroscience, which means that I now do far more statistics than I have been trained for. I cannot overstate to you the magnitude of the change in my productivity since finding this book. Even after reading the first few chapters, which explain why data analysis is painful and how one can implement a long-term solution, my research moved forward greatly."--Amber Baum, National Institute of Mental Health

"I enjoyed reading this book and learned a great deal from it. Modeling with Data filled in a lot of holes in my knowledge, and I think that will be true in general for other readers as well. There is a lot of high-quality and interesting material here."--Brendan Halpin, University of Limerick

Auszug. © Genehmigter Nachdruck. Alle Rechte vorbehalten.

Modeling with Data

Tools and Techniques for Scientific ComputingBy Ben Klemens

PRINCETON UNIVERSITY PRESS

Copyright © 2009 Princeton University Press
All right reserved.

ISBN: 978-0-691-13314-0

Contents

Preface.....................................................................xiChapter 1. Statistics in the modern day.....................................1PART I COMPUTING...........................................................15Chapter 2. C................................................................172.1 Lines...................................................................182.2 Variables and their declarations........................................282.3 Functions...............................................................342.4 The debugger............................................................432.5 Compiling and running...................................................482.6 Pointers................................................................532.7 Arrays and other pointer tricks.........................................592.8 Strings.................................................................652.9 [??] Errors.............................................................69Chapter 3. Databases........................................................743.1 Basic queries...........................................................763.2 [??] Doing more with queries............................................803.3 Joins and subqueries....................................................873.4 On database design......................................................943.5 Folding queries into C code.............................................983.6 Maddening details.......................................................1033.7 Some examples...........................................................108Chapter 4. Matrices and models..............................................1134.1 The GSL's matrices and vectors..........................................1144.2 apop_data...............................................................1204.3 Shunting data...........................................................1234.4 Linear algebra..........................................................1294.5 Numbers.................................................................1354.6 [??] gsl_matrix and gsl_vector internals................................1404.7 Models..................................................................143Chapter 5. Graphics.........................................................1575.1 plot....................................................................1605.2 [??] Some common settings...............................................1635.3 From arrays to plots....................................................1665.4 A sampling of special plots.............................................1715.5 Animation...............................................................1775.6 On producing good plots.................................................1805.7 [??] Graphs-nodes and flowcharts........................................1825.8 [??] Printing and LATEX.................................................185Chapter 6. [??] More coding tools...........................................1896.1 Function pointers.......................................................1906.2 Data structures.........................................................1936.3 Parameters..............................................................2036.4 [??] Syntactic sugar....................................................2106.5 More tools..............................................................214PART II STATISTICS.........................................................217Chapter 7. Distributions for description....................................2197.1 Moments.................................................................2197.2 Sample distributions....................................................2357.3 Using the sample distributions..........................................2527.4 Non-parametric description..............................................261Chapter 8. Linear projections...............................................2648.1 [??] Principal component analysis.......................................2658.2 OLS and friends.........................................................2708.3 Discrete variables......................................................2808.4 Multilevel modeling.....................................................288Chapter 9. Hypothesis testing with the CLT..................................2959.1 The Central Limit Theorem...............................................2979.2 Meet the Gaussian family................................................3019.3 Testing a hypothesis....................................................3079.4 ANOVA...................................................................3129.5 Regression..............................................................3159.6 Goodness of fit.........................................................319Chapter 10. Maximum likelihood estimation...................................32510.1 Log likelihood and friends.............................................32610.2 Description: Maximum likelihood estimators.............................33710.3 Missing data...........................................................34510.4 Testing with likelihoods...............................................348Chapter 11. Monte Carlo.....................................................35611.1 Random number generation...............................................35711.2 Description: Finding statistics for a distribution.....................36411.3 Inference: Finding statistics for a parameter..........................36711.4 Drawing a distribution.................................................37111.5 Non-parametric testing.................................................375Appendix A: Environments and makefiles......................................381A.1 Environment variables...................................................381A.2 Paths...................................................................385A.3 Make....................................................................387Appendix B: Text processing.................................................392B.1 Shell scripts...........................................................393B.2 Some tools for scripting................................................398B.3 Regular expressions.....................................................403B.4 Adding and deleting.....................................................413B.5 More examples...........................................................415Appendix C: Glossary........................................................419Bibliography................................................................435Index.......................................................................443

Chapter One

Statistics in the modern day

Retake the falling snow: each drifting flake Shapeless and slow, unsteady and opaque, A dull dark white against the day's pale white And abstract larches in the neutral light. -Nabokov (1962, lines 13-16)

Statistical analysis has two goals, which directly conflict. The first is to find patterns in static: given the infinite number of variables that one could observe, how can one discover the relations and patterns that make human sense? The second goal is a fight against apophenia, the human tendency to invent patterns in random static. Given that someone has found a pattern regarding a handful of variables, how can one verify that it is not just the product of a lucky draw or an overactive imagination?

Or, consider the complementary dichotomy of objective versus subjective. The objective side is often called probability; e.g., given the assumptions of the Central Limit Theorem, its conclusion is true with mathematical certainty. The subjective side is often called statistics; e.g., our claim that observed quantity A is a linear function of observed quantity B may be very useful, but Nature has no interest in it.

This book is about writing down subjective models based on our human understanding of how the world works, but which are heavily advised by objective information, including both mathematical theorems and observed data.

The typical scheme begins by proposing a model of the world, then estimating the parameters of the model using the observed data, and then evaluating the fit of the model. This scheme includes both a descriptive step (describing a pattern) and an inferential step (testing whether there are indications that the pattern is valid). It begins with a subjective model, but is heavily advised by objective data.

Figure 1.1 shows a model in flowchart form. First, the descriptive step: data and parameters are fed into a function-which may be as simple as a is correlated to b, or may be a complex set of interrelations-and the function spits out some output. Then comes the testing step: evaluating the output based on some criterion, typically regarding how well it matches some portion of the data. Our goal is to find those parameters that produce output that best meets our evaluation criterion.

The Ordinary Least Squares (OLS) model is a popular and familiar example, pictured in Figure 1.2. [If it is not familiar to you, we will cover it in Chapter 8.] Let X indicate the independent data, the parameters, and y the dependent data. Then the function box consists of the simple equation [y.sub.out] = X, and the evaluation step seeks to minimize squared error, [(y - [[y.sub.out]).sup.2].

For a simulation, the function box may be a complex flowchart in which variables are combined non-linearly with parameters, then feed back upon each other in unpredictable ways. The final step would evaluate how well the simulation output corresponds to the real-world phenomenon to be explained.

The key computational problem of statistical modeling is to find the parameters at the beginning of the flowchart that will output the best evaluation at the end. That is, for a given function and evaluation in Figure 1.1, we seek a routine to take in data and produce the optimal parameters, as in Figure 1.3. In the OLS model above, there is a simple, one-equation solution to the problem: [.sub.best] = [(X'X).sup.-1]X'y. But for more complex models, such as simulations or many multilevel models, we must strategically try different sets of parameters to hunt for the best ones.

And that's the whole book: develop models whose parameters and tests may discover and verify interesting patterns in the data. But the setup is incredibly versatile, and with different function specifications, the setup takes many forms. Among a few minor asides, this book will cover the following topics, all of which are variants of Figure 1.1:

Probability: how well-known distributions can be used to model data

Projections: summarizing many-dimensional data in two or three dimensions

Estimating linear models such as OLS

Classical hypothesis testing: using the Central Limit Theorem (CLT) to ferret out apophenia

Designing multilevel models, where one model's output is the input to a parent model

Maximum likelihood estimation

Hypothesis testing using likelihood ratio tests

Monte Carlo methods for describing parameters

"Nonparametric" modeling (which comfortably fits into the parametric form here), such as smoothing data distributions

Bootstrapping to describe parameters and test hypotheses

The Snowflake problem, or a brief history of statistical computing

The simplest models in the above list have only one or two parameters, like a Binomial (n, p) distribution which is built from n identical draws, each of which is a success with probability p [see Chapter 7]. But draws in the real world are rarely identical-no two snowflakes are exactly alike. It would be nice if an outcome variable, like annual income, were determined entirely by one variable (like education), but we know that a few dozen more enter into the picture (like age, race, marital status, geographical location, et cetera).

The problem is to design a model that accommodates that sort of complexity, in a manner that allows us to actually compute results. Before computers were common, the best we could do was analysis of variance methods (ANOVA), which ascribed variation to a few potential causes [see Sections 7.1.3 and 9.4].

The first computational milestone, circa the early 1970s, arrived when civilian computers had the power to easily invert matrices, a process that is necessary for most linear models. The linear models such as ordinary least squares then became dominant [see Chapter 8].

The second milestone, circa the mid 1990s, arrived when desktop computing power was sufficient to easily gather enough local information to pin down the global optimum of a complex function-perhaps thousands or millions of evaluations of the function. The functions that these methods can handle are much more general than the linear models: you can now write and optimize models with millions of interacting agents or functions consisting of the sum of a thousand sub-distributions [see Chapter 10].

The ironic result of such computational power is that it allows us to return to the simple models like the Binomial distribution. But instead of specifying a fixed n and p for the entire population, every observation could take on a value of n that is a function of the individual's age, race, et cetera, and a value of p that is a different function of age, race, et cetera [see Section 8.4].

The models in Part II are listed more-or-less in order of complexity. The infinitely quotable Albert Einstein advised, "make everything as simple as possible, but not simpler." The Central Limit Theorem tells us that errors often are Normally distributed, and it is often the case that the dependent variable is basically a linear or log-linear function of several variables. If such descriptions do no violence to the reality from which the data were culled, then OLS is the method to use, and using more general techniques will not be any more persuasive. But if these assumptions do not apply, we no longer need to assume linearity to overcome the snowflake problem.

The pipeline

A statistical analysis is a guided series of transformations of the data from its raw form as originally written down to a simple summary regarding a question of interest.

The flow above, in the statistics textbook tradition, picked up halfway through the analysis: it assumes a data set that is in the correct form. But the full pipeline goes from the original messy data set to a final estimation of a statistical model. It is built from functions that each incrementally transform the data in some manner, like removing missing data, selecting a subset of the data, or summarizing it into a single statistic like a mean or variance.

Thus, you can think of this book as a catalog of pipe sections and filters, plus a discussion of how to fit elements together to form a stream from raw data to final publishable output. As well as the pipe sections listed above, such as the ordinary least squares or maximum likelihood procedures, the book also covers several techniques for directly transforming data, computing statistics, and welding all these sections into a full program:

Structuring programs using modular functions and the stack of frames

Programming tools like the debugger and profiler

Methods for reliability testing functions and making them more robust

Databases, and how to get them to produce data in the format you need

Talking to external programs, like graphics packages that will generate visualizations of your data

Finding and using pre-existing functions to quickly estimate the parameters of a model from data.

Optimization routines: how they work and how to use them

Monte Carlo methods: getting a picture of a model via millions of random draws

To make things still more concrete, almost all of the sample code in this book is available from the book's Web site, linked from http://press.princeton. edu/titles/8706.html. This means that you can learn by running and modifying the examples, or you can cut, paste, and modify the sample code to get your own analyses running more quickly. The programs are listed and given a complete discussion on the pages of this book, so you can read it on the bus or at the beach, but you are very much encouraged to read through this book while sitting at your computer, where you can run the sample code, see what happens given different settings, and otherwise explore.

Figure 1.4 gives a typical pipeline from raw data to final paper. It works at a number of different layers of abstraction: some segments involve manipulating individual numbers, some segments take low-level numerical manipulation as given and operate on database tables or matrices, and some segments take matrix operations as given and run higher-level hypothesis tests.

The lowest level

Chapter 2 presents a tutorial on the C programming language itself. The work here is at the lowest level of abstraction, covering nothing more difficult than adding columns of numbers. The chapter also discusses how C facilitates the development and use of libraries: sets of functions written by past programmers that provide the tools to do work at higher and higher levels of abstraction (and thus ignore details at lower levels).

For a number of reasons to be discussed below, the book relies on the C programming language for most of the pipe-fitting, but if there is a certain section that you find useful (the appendices and the chapter on databases comes to mind) then there is nothing keeping you from welding that pipe section to others using another programming language or system.

Dealing with large data sets

Computers today are able to crunch numbers a hundred times faster they did a decade ago-but the data sets they have to crunch are a thousand times larger. Geneticists routinely pull 550,000 genetic markers each from a hundred or a thousand patients. The US Census Bureau's 1% sample covers almost 3 million people. Thus, the next layer of abstraction provides specialized tools for dealing with data sets: databases and a query language for organizing data. Chapter 3 presents a new syntax for talking to a database, Structured Query Language (SQL). You will find that many types of data manipulation and filtering that are difficult in traditional languages or stats packages are trivial-even pleasant-via SQL.

As Huber (2000, p 619) explains: "Large real-life problems always require a combination of database management and data analysis.... Neither database management systems nor traditional statistical packages are up to the task." The solution is to build a pipeline, as per Figure 1.4, that includes both database management and statistical analysis sections. Much of graceful data handling is in knowing where along the pipeline to place a filtering operation. The database is the appropriate place to filter out bad data, join together data from multiple sources, and aggregate data into group means and sums. C matrices are appropriate for filtering operations like those from earlier that took in data, applied a function like [(X'X).sup.-1]X'y, and then measured [([y.sub.out] - y).sup.2].

Because your data probably did not come pre-loaded into a database, Appendix B discusses text manipulation techniques, so when the database expects your data set to use commas but your data is separated by erratic tabs, you will be able to quickly surmount the problem and move on to analysis.

Computation

The GNU Scientific Library works at the numerical computation layer of abstraction. It includes tools for all of the procedures commonly used in statistics, such as linear algebra operations, looking up the value of F, t, [chi square] distributions, and finding maxima of likelihood functions. Chapter 4 presents some basics for data-oriented use of the GSL.

The Apophenia library, primarily covered in Chapter 4, builds upon these other layers of abstraction to provide functions at the level of data analysis, model fitting, and hypothesis testing.

Pretty pictures

Good pictures can be essential to good research. They often reveal patterns in data that look like mere static when that data is presented as a table of numbers, and are an effective means of communicating with peers and persuading grantmakers. Consistent with the rest of this book, Chapter 5 will cover the use of Gnuplot and Graphviz, two packages that are freely available for the computer you are using right now. Both are entirely automatable, so once you have a graph or plot you like, you can have your C programs autogenerate it or manipulate it in amusing ways, or can send your program to your colleague in Madras and he will have no problem reproducing and modifying your plots. Once you have the basics down, animation and real-time graphics for simulations are easy.

Why C?

You may be surprised to see a book about modern statistical computing based on a language composed in 1972. Why use C instead of a specialized language or package like SAS, Stata, SPSS, S-Plus, SAGE, SIENA, SUDAAN, SYSTAT, SST, SHAZAM, J, K, GAUSS, GAMS, GLIM, GENSTAT, GRETL, EViews, Egret, EQS, PcGive, MatLab, Minitab, Mupad, Maple, Mplus, Maxima, MLn, Mathematica, WinBUGS, TSP, HLM, R, RATS, LISREL, LispStat, LIMDEP, BMDP, Octave, Orange, OxMetrics, Weka, or Yorick? This may be the only book to advocate statistical computing with a general computing language, so I will take some time to give you a better idea of why modern numerical analysis is best done in an old language.

One of the side effects of a programming language being stable for so long is that a mythology builds around it. Sometimes the mythology is outdated or false: I have seen professional computer programmers and writers claim that simple structures like linked lists always need to be written from scratch in C (see Section 6.2 for proof otherwise), that it takes ten to a hundred times as long to write a program in C than in a more recently-written language like R, or that because people have used C to write device drivers or other low-level work, it can not be used for high-level work. This section is partly intended to dispel such myths.

Is C a hard language?

C was a hard language. With nothing but a basic 80s-era compiler, you could easily make many hard-to-catch mistakes. But programmers have had a few decades to identify those pitfalls and build tools to catch them. Modern compilers warn you of these issues, and debuggers let you interact with your program as it runs to catch more quirks. C's reputation as a hard language means the tools around it have evolved to make it an easy language.

Computational speed-really

Using a stats package sure beats inverting matrices by hand, but as computation goes, many stats packages are still relatively slow, and that slowness can make otherwise useful statistical methods infeasible.

(Continues...)


Excerpted from Modeling with Databy Ben Klemens Copyright © 2009 by Princeton University Press. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.

  • VerlagPrinceton University Press
  • Erscheinungsdatum2008
  • ISBN 10 069113314X
  • ISBN 13 9780691133140
  • EinbandTapa dura
  • SpracheEnglisch
  • Anzahl der Seiten472
  • Kontakt zum HerstellerNicht verfügbar

Gebraucht kaufen

Zustand: Gut
Ship within 24hrs. Satisfaction...
Diesen Artikel anzeigen

EUR 7,05 für den Versand von USA nach Deutschland

Versandziele, Kosten & Dauer

Gratis für den Versand innerhalb von/der Deutschland

Versandziele, Kosten & Dauer

Suchergebnisse für Modeling with Data: Tools and Techniques for Scientific...

Beispielbild für diese ISBN

Klemens, Ben
ISBN 10: 069113314X ISBN 13: 9780691133140
Gebraucht Hardcover

Anbieter: BooksRun, Philadelphia, PA, USA

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Hardcover. Zustand: Very Good. Ship within 24hrs. Satisfaction 100% guaranteed. APO/FPO addresses supported. Artikel-Nr. 069113314X-8-1

Verkäufer kontaktieren

Gebraucht kaufen

EUR 31,60
Währung umrechnen
Versand: EUR 7,05
Von USA nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: 1 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Klemens, Ben
ISBN 10: 069113314X ISBN 13: 9780691133140
Gebraucht Hardcover

Anbieter: ThriftBooks-Dallas, Dallas, TX, USA

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Hardcover. Zustand: Very Good. No Jacket. May have limited writing in cover pages. Pages are unmarked. ~ ThriftBooks: Read More, Spend Less 1.01. Artikel-Nr. G069113314XI4N00

Verkäufer kontaktieren

Gebraucht kaufen

EUR 33,90
Währung umrechnen
Versand: EUR 6,69
Von USA nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: 1 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Klemens, Ben
ISBN 10: 069113314X ISBN 13: 9780691133140
Gebraucht Hardcover

Anbieter: Anybook.com, Lincoln, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: Good. This is an ex-library book and may have the usual library/used-book markings inside.This book has hardback covers. In good all round condition. No dust jacket. Please note the Image in this listing is a stock photo and may not match the covers of the actual item,1200grams, ISBN:9780691133140. Artikel-Nr. 5832782

Verkäufer kontaktieren

Gebraucht kaufen

EUR 69,35
Währung umrechnen
Versand: EUR 8,01
Von Vereinigtes Königreich nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: 1 verfügbar

In den Warenkorb

Foto des Verkäufers

Klemens, Ben
ISBN 10: 069113314X ISBN 13: 9780691133140
Neu Hardcover

Anbieter: moluna, Greven, Deutschland

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: New. Explains how to execute computationally intensive analysis on very large data sets. This book shows readers how to determine some of the best methods for solving a variety of different problems, how to create and debug statistical models, and how to run an . Artikel-Nr. 447030844

Verkäufer kontaktieren

Neu kaufen

EUR 89,45
Währung umrechnen
Versand: Gratis
Innerhalb Deutschlands
Versandziele, Kosten & Dauer

Anzahl: 3 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Ben Klemens
ISBN 10: 069113314X ISBN 13: 9780691133140
Neu Hardcover

Anbieter: PBShop.store UK, Fairford, GLOS, Vereinigtes Königreich

Verkäuferbewertung 4 von 5 Sternen 4 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

HRD. Zustand: New. New Book. Shipped from UK. Established seller since 2000. Artikel-Nr. WP-9780691133140

Verkäufer kontaktieren

Neu kaufen

EUR 105,48
Währung umrechnen
Versand: EUR 4,91
Von Vereinigtes Königreich nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: 3 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Klemens, Ben
ISBN 10: 069113314X ISBN 13: 9780691133140
Neu Hardcover

Anbieter: Ria Christie Collections, Uxbridge, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Zustand: New. In. Artikel-Nr. ria9780691133140_new

Verkäufer kontaktieren

Neu kaufen

EUR 107,73
Währung umrechnen
Versand: EUR 5,94
Von Vereinigtes Königreich nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: 3 verfügbar

In den Warenkorb

Beispielbild für diese ISBN

Ben Klemens
ISBN 10: 069113314X ISBN 13: 9780691133140
Neu Hardcover

Anbieter: Revaluation Books, Exeter, Vereinigtes Königreich

Verkäuferbewertung 5 von 5 Sternen 5 Sterne, Erfahren Sie mehr über Verkäufer-Bewertungen

Hardcover. Zustand: Brand New. illustrated edition. 470 pages. 10.00x7.00x1.25 inches. In Stock. Artikel-Nr. x-069113314X

Verkäufer kontaktieren

Neu kaufen

EUR 173,15
Währung umrechnen
Versand: EUR 11,92
Von Vereinigtes Königreich nach Deutschland
Versandziele, Kosten & Dauer

Anzahl: 2 verfügbar

In den Warenkorb