If you want to become a data scientist (or to understand what they’re talking about), statistics is one of the key domains that you’re going to have to learn. In "Naked Statistics," Charles Wheelan does a surprisingly good job of providing a guide that anyone can grasp. This slim, accessible book provides a wonderfully entertaining introduction to the key intuitions required to understand statistics.
Why should you care about stats?
As Wheelan points out, we are surrounded by statistics— whether it’s weather forecasts, insurance underwriting, or baseball stats. At the same time, introductory mathematical texts usually do a woeful job of telling you why you should care about stats.
I know that combinatorics is an important thing to learn, but it’s hard to get excited about my ability to predict the likelihood of pulling a given color of ball out of a bag with 10 white and 8 black balls. It just doesn’t seem like it’s going to be one of those things that’s going to come up very much in day-to-day life!
In "Naked Statistics," instead of focusing on contrived examples and practice questions, Wheelan focuses on introducing the principles underlying stats and why they’re important. And that is both the strength and the weakness of the book.
What does it cover?
"Naked Statistics" provides a well rounded introduction to a range of key topics within the field of statistics including:
Wheelan starts out with an introduction to descriptive statistics, showing how various stats can be used to provide a useful (if imperfect) summary of information we might care about. From the Gini index (often used for economic inequality) to GPAs (academic achievement) he shows both the strengths and weaknesses of simplifying a comparison down to a single number.
Taking the example of Netflix movie recommendations, Wheelan then introduces the idea of a correlation coefficient (relegating the actual equation to an appendix at the back of the book) and how it can be used to determine how similar two viewers might be. He then runs through how such a coefficient could be used to find viewers with similar tastes and to recommend highly reviewed films from one viewer to other similar viewers.
By challenging 100 Michelob drinkers to a “blind taste test” live during the Super Bowl in 1981, according to Wheelan, Schlitz proved that they knew just as much about statistics as they did about brewing! Introducing the Bernoulli distribution, he points out how small a risk they were taking. Assuming that there is no material difference in taste between the two brands, the odds of at least 40 of the 100 drinkers preferring Schlitz was over 98%.
He also introduces the law of large numbers (why not to buy a lottery ticket) and shows why extended warranties on inexpensive products are a bad buy. He also runs the reader through a worked example of the risks of screening large populations for rare, but serious diseases.
The Monty Hall problem
Next up, he tackles the famous (at least among statisticians) Monty Hall problem based on the “Let’s Make a Deal” TV show that premiered in 1963. In the show, contestants pick one of three doors to try to win a prize. The host then opens another door, showing that it doesn’t hold the prize and asks them whether to stay with their first choice or whether they want to pick another door.
Intuitively, it might seem that it doesn’t really matter whether you change your mind or not, but Wheelan proves in three independent ways that there is indeed a greater chance of winning if you pick another door.
Central Limit Theorem
Wheelan introduces Central Limit Theorem as “the Lebron James of Statistics – if Lebron were also a supermodel, a Harvard professor, and the winner of the Nobel Peace Prize”! It’s an incredibly important statistical concept that students sometimes find hard to grasp. But with an example based on determining whether a missing bus was going to a marathon or the International Festival of Sausage based on the average weight of the riders, Wheelan makes the concept very easy to understand.
Using the example of his statistics professors’ concerns when he did implausibly better on his final than his midterm one year, Wheelan highlights both the strengths and weaknesses of statistical inference. He introduces p-values, correlation vs. causation and statistical significance vs. effect size. He also runs us through confidence intervals, Type I vs. Type II errors and why false negatives for a spam filter might be more acceptable than for cancer screening.
In the polling section, Wheelan introduces the principles underlying polling. Based on Central Limit Theorem, he shows how a small sample can be predictive of the opinions of a large population and how to calculate the standard error for a given poll. He also highlights some of the key risks around polling – including but not limited to sample selection, phrasing of questions and the likelihood of the respondents actually telling the truth.
Finally, in the regression section, he runs through how a regression can be used to quantify the relationship between a particular variable and an outcome we care about – while controlling for other variables.
This book’s key strength is also its main limitation: the focus is on the intuitions and not the details of the math. For anyone looking to get up to speed with the “why” of statistics it’s the perfect introduction. It introduces many of the key concepts with practical, relevant examples that show just how important statistics are.
If you’re planning on a career in data science, the lack of practical, hands-on exercises makes it hard to truly internalize all of the concepts or to be confident in applying the principles to other problems. And the lack of equations means that anyone who is a little out of practice reading mathematical notation is not going to start to build that critical skill by reading this book.
That said, this is still the perfect introduction to statistics for the aspiring data scientist, you’re just going to have to do some additional reading after you finish the book.