The Data on Barbie, Greta Gerwig, and Best Director Snubs at the Oscars

When the 2024 Academy Award nominees were announced in late January, one of the most hotly discussed topics was that Greta Gerwig, director of Barbie, was not nominated for Best Director, despite the film being nominated for Best Picture. I assumed a Best Director nomination went hand-in-hand with a Best Picture nomination, so how common is it for a film to be nominated for Best Picture, but not Best Director? It turns out, fairly often, at least since 2009.

50 years of Best Picture and Best Director Oscar nominations
The chart above comes from Flatiron’s analysis of over 50 years of Best Picture and Best Director Oscar nominations. Films that win these two awards are often nominated in both categories.

From 1970 to 2008, the Best Picture and Best Director categories had five nominees each. It was common to see four of the five Best Picture nominees also receiving a nomination for Best Director. And in 32 of these 39 years, the film that won Best Picture also won Best Director.

In 2009, the Best Picture nomination limit increased to 10 films. Best Director remained capped at five, so naturally, this resulted in more Best Director snubs than before. In terms of winners, the larger pool of Best Picture nominees seems to be aiding in separating the two awards. Best Picture and Best Director Oscars have gone to two different films in six of the last 14 years (this happened only seven times in the 39 years before 2009).

Barbenheimer

Although it’s no longer uncommon for a film to receive a Best Picture nomination without one for Best DIrector, Barbie wasn’t just any film. Barbie was one half of the cultural phenomenon known as Barbenheimer. A mashup of two highly anticipated and starkly different films—Barbie, and director Christopher Nolan’s historical biopic Oppenheimer—both hit theaters on July 21, 2023. The goal of seeing both films back-to-back became one of the defining characteristics of the Barbenheimer phenomenon. While both films were hugely successful at the domestic and international box office, Barbie out-grossed Oppenheimer by an estimated half-billion dollars worldwide.

The two films dominated the zeitgeist for much of 2023 and both received enormous critical acclaim. Oppenheimer has dominated this awards season, however, with 13 Oscar nominations garnered and multiple important wins at other film awards ceremonies leading up to the Academy Awards on March 10.

We’ll return to how we think about “importance” in the context of nominations, but for now, let’s compare the two films along the lines of major award ceremonies, ratings, and box office revenue.

Barbie vs Oppenheimer

analysis comparing Barbie and Oppenheimer performance by major awards
The graphic above comes from our analysis comparing Barbie and Oppenheimer. Both films have numerous award nominations and have brought in over two billion dollars combined.

Minus its take at the People’s Choice Awards, Oppenheimer has taken home more awards overall, despite having a similar number of nominations at most award shows. Barbie appeared to be on a roll this award season, with nominations for picture, director, screenplay, actress, and supporting actor at the Golden Globes and Critics Choice Awards in early January. However, Greta Gerwig was left out of the director category when the Oscar nominees were announced on January 23. This leads to the question, what films are most similar to Barbie, not just by nomination count, but across major categories? And were those films nominated for Best Director?

Movies Like Barbie

We began our Best Director snubs analysis at Flatiron by collecting all past nominees across the entire history of the awards ceremonies noted in the image above—swapping out the People’s Choice Awards for the Writers Guild Awards—for a comprehensive dataset of non-fan nominations. We also merged categories like Best Adapted Screenplay and Best Original Screenplay into one screenplay category for ease of comparison. Similarly, we lumped all acting categories–male, female, lead, and supporting–into one, and all Best Picture categories into one if split into drama and comedy/musical categories (like the Golden Globes does).

With a dataset of over 3,000 nominees going back to the 1920s, we found films most similar to Barbie across our grouped screenplay, grouped actor(s), director, and picture categories using Euclidean distance, a method for finding the distance between two data points. The five films below are the most similar to Barbie according to the awards and groupings we’ve selected. Interestingly, these five films, including Gerwig’s 2017 debut film, Lady Bird, all received a Best Director nomination at the Oscars (while Gerwig’s directing work on Barbie did not).

comparing barbie's nominations to other high-performing movies from previous award seasons

Predicting Best Director Snubs at the Oscars

A sample size of five is certainly not enough evidence to make a definitive claim of a snub, so we developed a predictive model that classifies a film as a Best Director nominee based on the other nominations it received, either at the Oscars or previous award shows. Our final model achieved 91% accuracy. For the astute reader, it also reached 93% precision and 96% recall. 

Based on films from 1927 to 2022, the best predictor of a Best Director nomination at the Oscars is a Best Picture nomination at the Oscars. This isn’t surprising, considering the overlap in nominees that we observed in the first image at the top of the article.

Other notable predictors are Best Screenplay at the Oscars or Critics Choice Awards, and Best Director at the Golden Globes or Director’s Guild Awards (DGA). These predictors align with intuition, given the importance of a good script and how common it is to have a filmmaker with the title of writer/director. In the case of the DGA, it’s hard to think of a more qualified group to identify the best directors of the year than the 19,000-plus directors who make up the guild’s membership 

Trained Model Predictions

Finally, using our trained model, we applied it to our list of 2023 films that received at least one nomination in a screenplay, acting, directing, or picture category. Given the long list of accolades received by Barbie at the Golden Globes, Critics Choice Awards, British Academy Film Awards (BAFTA), and all the filmmaking guild awards, our model predicted Greta Gerwig to have a 76% chance of snagging a Best Director nomination. Considering she was in third, just behind Christopher Nolan for Oppenheimer and Yorgos Lanthimos for Poor Things, I’d call this a snub. (Gerwig tied for third with Justine Triet for Anatomy of a Fall.)

which best director nominations were predicted by a trained model

Best Director Snubs and Flatiron’s Analysis

Rank-ordering the predicted probability of receiving the directorial nomination, the 2017 film Three Billboards Outside Ebbing, Missouri by writer/director Martin McDonagh was our model’s biggest snub. A film that initially received wide acclaim, it later faced criticism over its portrayal of misogyny and racism. Coincidentally, Greta Gerwig was one of the five directors nominees that year alongside Guillermo del Toro, Christopher Nolan, Jordan Peele, and Paul Thomas Anderson—a star-studded list of filmmakers if ever there was one. 

the biggest "best director" snubs over the last 25 years
The table above shows where our model was highly confident—but ultimately, incorrect—that a film would receive the Best Director nod.

It’s worth noting that many of the films listed in our table above also appear in a recent Variety article that ranked the biggest Best Director snubs over the last 25 years. While the writer of the Variety article does not discuss his methodology, it’s always a good idea in data science to validate findings with subject matter experts. In the case of our analysis and the Variety article analysis, there seems to be some agreement. 

Final Thoughts

As with all predictive models, our model is only as good as the data it learns from. A common criticism of the Academy is its lack of nominating women and people of color across categories, particularly for Best Director. Mitigating bias and ensuring fairness in predictive models are important concepts in Big Data Ethics, but we’ll save the ways one could address these issues for another post.

Learn Data Science at Flatiron School

Data analyst is just one of the career paths you can embark on after graduating from Flatiron’s Data Science Bootcamp. Our bootcamp offers students the opportunity to graduate and begin working in the field in as little as 15 weeks. Download the course syllabus for free to see what you can learn!

Header photo courtesy of Warner Bros. Pictures

Using Scikit-Learn for Machine Learning in Python

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Given that Python is the most widely used language in data science and taught in Flatiron’s Data Science Bootcamp, we’ll begin by describing what the aforementioned terms mean before turning to the topic of using scikit-learn.

An interpreted language is one that is more flexible than a compiled language like C since it directly executes instructions written within the language.

An object-oriented language is one that is designed around data or objects.

A high-level programming language is one that can be easily understood by humans since its syntax reflects human usage. 

Finally, dynamic semantics is a framework that allows the meaning of a term to be updated based on context. All of these attributes of Python make it work well within data science since it is a flexible, easy-to-read language that works with data well.

Python’s Libraries

Python’s power is expanded by the use of libraries. A Python library is a collection of related modules that allow one to perform common tasks without having to create the functions for the tasks anew. Two libraries that are inevitably used when working with Python in data science are NumPy and pandas. The former allows one to efficiently deal with large matrices and perform mathematical operations on those objects. The latter offers data structures and operations for data manipulation and analysis.

In Flatiron’s Data Science Bootcamp, among the first tools that one learns when being introduced to Python are NumPy and pandas. While there are a number of other widely used tools in Python for data science, I’ll mention the following, which are also taught in the bootcamp:

From here on out, we’ll focus on scikit-learn since it is the primary library for machine learning in Python.

Scikit-Learn

Two Types of Machine Learning

Machine learning is a branch of artificial intelligence and computer science that uses data and algorithms to imitate the way humans learn. Machine learning is often distinguished between supervised and unsupervised learning.

In supervised learning, the algorithm learns from labeled data. Here, each example in the data set is associated with a corresponding label or output. An example of supervised learning would be an algorithm that is learning to correctly identify spam or not-spam email. 

In unsupervised learning, the algorithm learns patterns and structures from unlabeled data without any guidance in the form of labeled outputs. Market clustering, where an algorithm creates clusters of individuals based on demographic data is an example of unsupervised learning.

Using Scikit-Learn: Key Features

Scikit-learn is a popular open-source machine learning library for the Python programming language. It provides simple and efficient tools for data mining and data analysis and is built on top of other scientific computing packages such as NumPy, SciPy, and matplotlib. The following are all key features that help data scientists using scikit-learn work smoothly and efficiently. 

Consistent API

Scikit-learn provides a uniform and consistent API for various machine learning algorithms, making it easy to use and switch between different algorithms. Other libraries such as the aforementioned Keras mimic the scikit-learn syntax, which makes learning how to use other libraries easier.

Wide Range of Algorithms

It offers a comprehensive suite of machine learning algorithms for various tasks, including:

  • Classification (identifying which category an object belongs to)
  • Regression (predicting a continuous-valued attribute associated with an object)
  • Clustering (automatic grouping of similar objects into sets)
  • Dimensionality reduction (reducing the number of random variables to consider)
  • Model selection (comparing, validating, and choosing parameters and models)
  • Preprocessing (feature extraction and normalization)

Ease of Use

Scikit-learn is designed with simplicity and ease of use in mind, making it accessible to both beginners and experts in machine learning. This is not only an artifact of scikit-learn being designed well, but it being a library in Python.

Integration with Other Libraries

It integrates seamlessly with other Python libraries such as NumPy, SciPy, and matplotlib (all of which it is built on). This allows for efficient data manipulation, computation, and visualization.

Community and Documentation

Scikit-learn has a large and active community of users and developers, providing extensive documentation, tutorials, and examples to help users start solving real-world problems. In our experience, we have not used better documentation for any programming language or library than what there is for scikit-learn.

Performance and Scalability

While scikit-learn may not be optimized for very large datasets or high-performance computing, it offers good performance and scalability for most typical machine learning tasks. For very large data sets and interacting with the cloud, there are similar libraries available, such PySpark’s machine learning library MLlib.

Using Scikit-Learn: Conclusion

Overall, scikit-learn is a powerful and versatile library. It’s a standard tool for machine learning practitioners and researchers due to its simplicity, flexibility, and wide range of capabilities. Currently, it is not possible to be a data scientist using Python and not be comfortable and proficient in scikit-learn. That’s why our Flatiron School emphasizes it throughout its curriculum. 

Want to Learn About Careers in Data Science?

Learn about data science career paths through our Data Science Bootcamp page. Curious to know what students learn during their time at Flatiron? Join us for our Final Project Showcase and see work from recent grads.

Rejecting the Null Hypothesis Using Confidence Intervals

In an introductory statistics class, there are three main topics that are taught: descriptive statistics and data visualizations, probability and sampling distributions, and statistical inference. Within statistical inference, there are two key methods of statistical inference that are taught, viz. confidence intervals and hypothesis testing. While these two methods are always taught when learning data science and related fields, it is rare that the relationship between these two methods is properly elucidated.

In this article, we’ll begin by defining and describing each method of statistical inference in turn and along the way, state what statistical inference is, and perhaps more importantly, what it isn’t. Then we’ll describe the relationship between the two. While it is typically the case that confidence intervals are taught before hypothesis testing when learning statistics, we’ll begin with the latter since it will allow us to define statistical significance.

Hypothesis Tests

The purpose of a hypothesis test is to answer whether random chance might be responsible for an observed effect. Hypothesis tests use sample statistics to test a hypothesis about population parameters. The null hypothesis, H0, is a statement that represents the assumed status quo regarding a variable or variables and it is always about a population characteristic. Some of the ways the null hypothesis is typically glossed are: the population variable is equal to a particular value or there is no difference between the population variables. For example:

  • H0: μ = 61 in (The mean height of the population of American men is 69 inches)
  • H0: p1-p2 = 0 (The difference in the population proportions of women who prefer football over baseball and the population proportion of men who prefer football over baseball is 0.)

Note that the null hypothesis always has the equal sign.

The alternative hypothesis, denoted either H1 or Ha, is the statement that is opposed to the null hypothesis (e.g., the population variable is not equal to a particular value  or there is a difference between the population variables):

  • H1: μ > 61 im (The mean height of the population of American men is greater than 69 inches.)
  • H1: p1-p2 ≠ 0 (The difference in the population proportions of women who prefer football over baseball and the population proportion of men who prefer football over baseball is not 0.)

The alternative hypothesis is typically the claim that the researcher hopes to show and it always contains the strict inequality symbols (‘<’ left-sided or left-tailed, ‘≠’ two-sided or two-tailed, and ‘>’ right-sided or right-tailed).

When carrying out a test of H0 vs. H1, the null hypothesis H0 will be rejected in favor of the alternative hypothesis only if the sample provides convincing evidence that H0 is false. As such, a statistical hypothesis test is only capable of demonstrating strong support for the alternative hypothesis by rejecting the null hypothesis.

When the null hypothesis is not rejected, it does not mean that there is strong support for the null hypothesis (since it was assumed to be true); rather, only that there is not convincing evidence against the null hypothesis. As such, we never use the phrase “accept the null hypothesis.”

In the classical method of performing hypothesis testing, one would have to find what is called the test statistic and use a table to find the corresponding probability. Happily, due to the advancement of technology, one can use Python (as is done in the Flatiron’s Data Science Bootcamp) and get the required value directly using a Python library like stats models. This is the p-value, which is short for the probability value.

The p-value is a measure of inconsistency between the hypothesized value for a population characteristic and the observed sample. The p-value is the probability, under the assumption the null hypothesis is true, of obtaining a test statistic value that is a measure of inconsistency between the null hypothesis and the data. If the p-value is less than or equal to the probability of the Type I error, then we can reject the null hypothesis and we have sufficient evidence to support the alternative hypothesis.

Typically the probability of a Type I error ɑ, more commonly known as the level of significance, is set to be 0.05, but it is often prudent to have it set to values less than that such as 0.01 or 0.001. Thus, if p-value ≤ ɑ, then we reject the null hypothesis and we interpret this as saying there is a statistically significant difference between the sample and the population. So if the p-value=0.03 ≤ 0.05 = ɑ, then we would reject the null hypothesis and so have statistical significance, whereas if p-value=0.08 ≥ 0.05 = ɑ, then we would fail to reject the null hypothesis and there would not be statistical significance.

Confidence Intervals

The other primary form of statistical inference are confidence intervals. While hypothesis tests are concerned with testing a claim, the purpose of a confidence interval is to estimate an unknown population characteristic. A confidence interval is an interval of plausible values for a population characteristic. They are constructed so that we have a chosen level of confidence that the actual value of the population characteristic will be between the upper and lower endpoints of the open interval.

The structure of an individual confidence interval is the sample estimate of the variable of interest margin of error. The margin of error is the product of a multiplier value and the standard error, s.e., which is based on the standard deviation and the sample size. The multiplier is where the probability, of level of confidence, is introduced into the formula.

The confidence level is the success rate of the method used to construct a confidence interval. A confidence interval estimating the proportion of American men who state they are an avid fan of the NFL could be (0.40, 0.60) with a 95% level of confidence. The level of confidence is not the probability that that population characteristic is in the confidence interval, but rather refers to the method that is used to construct the confidence interval.

For example, a 95% confidence interval would be interpreted as if one constructed 100 confidence intervals, then 95 of them would contain the true population characteristic. 

Errors and Power

A Type I error, or a false positive, is the error of finding a difference that is not there, so it is the probability of incorrectly rejecting a true null hypothesis is ɑ, where ɑ is the level of significance. It follows that the probability of correctly failing to reject a true null hypothesis is the complement of it, viz. 1 – ɑ. For a particular hypothesis test, if ɑ = 0.05, then its complement would be 0.95 or 95%.

While we are not going to expand on these ideas, we note the following two related probabilities. A Type II error, or false negative, is the probability of failing to reject a false null hypothesis where the probability of a type II error is β and the power is the probability of correctly rejecting a false null hypothesis where power = 1 – β. In common statistical practice, one typically only speaks of the level of significance and the power.

The following table summarizes these ideas, where the column headers refer to what is actually the case, but is unknown. (If the truth or falsity of the null value was truly known, we wouldn’t have to do statistics.)

Demonstrating the four possibilities actual outcomes of a hypothesis test with their probabilities.

Hypothesis Tests and Confidence Intervals

Since hypothesis tests and confidence intervals are both methods of statistical inference, then it is reasonable to wonder if they are equivalent in some way. The answer is yes, which means that we can perform hypothesis testing using confidence intervals.

Returning to the example where we have an estimate of the proportion of American men that are avid fans of the NFL, we had (0.40, 0.60) at a 95% confidence level. As a hypothesis test, we could have the alternative hypothesis as H1 ≠ 0.51. Since the null value of 0.51 lies within the confidence interval, then we would fail to reject the null hypothesis at ɑ = 0.05.

On the other hand, if H1 ≠ 0.61, then since 0.61 is not in the confidence interval we can reject the null hypothesis at ɑ = 0.05. Note that the confidence level of 95% and the level of significance at ɑ = 0.05 = 5%  are complements, which is the “Ho is True” column in the above table.

In general, one can reject the null hypothesis given a null value and a confidence interval for a two-sided test if the null value is not in the confidence interval where the confidence level and level of significance are complements. For one-sided tests, one can still perform a hypothesis test with the confidence level and null value. Not only is there an added layer of complexity for this equivalence, it is the best practice to perform two-sided hypothesis tests since one is not prejudicing the direction of the alternative.

In this discussion of hypothesis testing and confidence intervals, we not only understand when these two methods of statistical inference can be equivalent, but now have a deeper understanding of statistical significance itself and therefore, statistical inference.

Learn More About Data Science at Flatiron

The curriculum in our Data Science Bootcamp incorporates the latest technologies, including artificial intelligence (AI) tools. Download the syllabus to see what you can learn, or book a 10-minute call with Admissions to learn about full-time and part-time attendance opportunities.

Kicking It Up a Notch: Exploring Data Analytics in Soccer

Soccer isn’t just a sport, it’s a global phenomenon. From the electrifying energy of packed stadiums to the shared passion of fans across continents, soccer unites the world like no other. Record-breaking viewership for the 2022 World Cup stands as a testament to this undeniable truth. Soccer’s global reach has fueled a data revolution within the sport. Data analytics is rapidly transforming soccer, impacting teams, players, and organizations through increasingly data-driven decisions. With the 2024 Major League Soccer (MLS) season kicking off, let’s look at how data analytics in soccer is changing the game, one insightful analysis at a time.

The use of data analytics in soccer can be loosely broken down into several key areas of focus.

  • Game planning: The meta analysis of games and matches to determine best play strategies
  • Performance: The hard stats, from individual players to teams to full leagues
  • Recruitment: Finding potential player and coaching talent 

Game Planning

Game planning itself can cover a wide range of topics, including player match-ups and game strategy. Analysis in this field involves utilizing previous games and matches to determine strategies against a given team.

Data analytics in soccer

Image source: Soccer Coach Weekly 

Opponent analysis is a tried and true method across all sports. This form of analysis can be very granular, looking at individual players in specific situations, such as penalty kick line-ups. It can also be high level, identifying overall trends and patterns in opponents’ play that can be exploited. The results of such analysis can drastically change how a team or player approaches a game. 

With the advent of new technology being utilized for data analytics in soccer, opponent analysis and game planning are being brought to new levels of complexity. The combination of wearable tracking devices and modern camera technology has opened the floodgates of data collection, resulting in a smorgasbord of play-by-play positional data that analysts can use to inform game planning decisions. The real-time capabilities of image detection and computer vision also allows for coaches and staff to make in-the-moment decisions.

As technology continues advancing, avenues for game planning analysis include the utilization of virtual and augmented reality to coordinate, plan, and practice set-pieces and specific plays.

Performance

A major focus of data analytics in soccer is the collection and use of performance data to inform decisions. Front and center is the analysis of player performance to help develop and improve the individual player. Team performance is often analyzed as well, in conjunction with game planning and in the context of specific match-ups. Developing appropriate metrics to quantify performance is a vital part of this equation and is an ever-evolving field. 

a soccer player kicking a ball down a pitch

Image source: Science for Sport

Individual player performance—and the way it is measured—has drastically changed across the lifetime of soccer analytics. Metrics like expected goals now utilize predictive analytics to quantify a player’s near-future performance (in conjunction with secondary statistics that gather a holistic view of a player’s team contributions). Monitoring and evaluating player performance is essential for trainers and coaches in helping improve player strengths and weaknesses.

A major advancement has been the introduction of non-intrusive wearable devices that can monitor and collect a player’s vital responses. While there are privacy, consent, and data security concerns when it comes to wearables, they have amazing potential to not only help improve player performance on the field but more importantly, help players prevent (and recover) from serious injuries. 

Predicting player and team performance utilizing machine learning algorithms continues to become more important, and has opened up a whole new avenue in data analytics in soccer when it comes to finding talent.

Recruitment

A vitally important part of a winning strategy is bringing together the right mix of talent across players and staff. Soccer organizations are putting a huge emphasis on academic-driven analytics, especially in regards to talent scouting. Currently, there are an extraordinary number of performance analysts working to aid recruitment efforts in the U.S. MLS. Sports journalist Ben Lyttleton sums it up nicely with the below quote, taken from his article on data and decisions in soccer

”Today, the most important hire is no longer the 30-goal-a-season striker or an imposing brick wall of a defender. Instead, there’s an arms race for the person who identifies that talent.”

Performance data for professional soccer player Daniel Pereira
Performance data for professional soccer player Daniel Pereira

Image source: TFA

The Power of Moneyball and Soccernomics

Major shifts in sports analytics occurred following the publication of the books Moneyball in 2003 and Soccernomics in 2009.

Both books expose the power of data in finding undervalued players by highlighting undervalued metrics like niche playing tactics in soccer and on-base percentage in baseball. The end result? Smaller sports organizations can avoid overspending on flashy but less-impactful players and instead focus on acquiring hidden gems with specific skills at lower costs. By embracing data-driven strategies, smaller organizations gain a competitive edge against bigger spenders, proving that efficiency and smart player selection can trump financial muscle.

Predictive analytics has started to play a huge role in this process with organizations attempting to predict the performance of players, team compositions, and even coaches. Feature engineering is playing a pivotal role in advancing this field. How can we quantify the unquantifiable? For example, how can you measure a player’s relationship and attitude with his teammates and coaches? Something so subtle and nuanced has a huge effect on individual and team performance. 

Summary

Soccer organizations are placing greater emphasis on utilizing data analytics in their management and recruitment decision-making processes. The inclusion of new technologies to advance the science of data collection allows analysts to capture the minutiae of player performance across many aspects of the game. There is a huge need for intuitive, creative-thinking data analysts within the realm of soccer (and sports as a whole), and their analyses will play a pivotal role in how the game of soccer continues to evolve and thrive. 

Learn Data Science at Flatiron School

Flatiron’s Data Science Bootcamp can put you on the path to a career in the field in as little as 15 weeks. Download our syllabus to see what you can learn, or take a prep course for free. You can also schedule a 10-minute call with our admissions office to learn more about the school and its program.

Additional Reading

Troy Hendrickson: From Sales to Stats Auditor for the NBA

Driven by a passion for sports and a desire to leverage data for deeper insights, Troy Hendrickson attended Flatiron School’s Data Science bootcamp in the hopes of joining his dream industry – professional sports. Read his inspiring story of transformation from coach and salesperson to Stats Auditor at the National Basketball Association (NBA)!

Before Flatiron: What were you doing and why did you decide to switch gears?

Troy’s background in sports management and sales success hinted at his potential in the data-driven world of sports analytics. However, the lack of technical skills held him back from his dream career. “I knew data science could one day lead me to work for a sports franchise,” he says, “and sales wasn’t my passion.”

During Flatiron: What surprised you most about yourself and the learning process during your time at Flatiron School?

Flatiron wasn’t just about learning new skills; it was about self-discovery. “I surprised myself with how much I enjoyed the learning process,” Troy admits. The supportive community and emphasis on practical application fueled his drive. “Flatiron’s values aligned perfectly with mine,” he says, highlighting the school’s focus on grit and a growth mindset.

After Flatiron: What are you most proud of in your new tech career?

After 277 days of dedicated job search, Troy landed his dream role at the NBA. His journey wasn’t without challenges. “Initially, I applied too often,” he reflects. But by strategically leveraging his network and continuously learning, he landed interviews and impressed hiring managers with his data-driven insights and passion for sports. “My NBA Prediction Capstone project was a clincher,” he reveals, showcasing the power of project-based learning at Flatiron.

Inspired by Troy’s story?

Flatiron School can equip you with the skills and confidence to pursue your tech dreams, no matter your background. Join a supportive community of learners and experienced instructors, and step outside your comfort zone like Troy. The future of tech awaits you!

Ready to take charge of your future? Apply Now to join other career changers like Troy in a program that sets you apart from the competition. Read more stories about successful career changes on the Flatiron School blog.

Hyperbolic Tangent Activation Function for Neural Networks

Artificial neural networks are a class of machine learning algorithms. Their creation by Warren McCullough and Walter Pitts in 1944 was inspired by the human brain and the way that biological neurons signal one another. Neural networks are a machine learning algorithm since the algorithm will analyze data with known labels so it can be trained to recognize images that it has not seen before.. For example, in the Data Science Bootcamp at Flatiron School one learns how to use these networks to determine whether an image shows cancer cells present in a fine needle aspirate (FNA) of a breast mass.

Neural networks are comprised of node (artificial neuron) layers, containing the following:

  • an input layer
  • one or more hidden layers
  • an output layer

A visual representation of this is on view in the figure below. (All images in the post are from the Flatiron School curriculum unless otherwise noted.)

Visual representation of neural networks with input, hidden, and output layers.

Each node connected to another has an associated weight and threshold. If the output is above the specified threshold value, then the node activates.. This activation results in data (the sum of the weighted inputs) traveling from the node to the next layer that is composed of nodes. However, if the node is not activated, then it does not pass data along to the next layer. A popular subset of neural networks are deep learning models, which are neural networks that have a large number of hidden layers.

Neural Network Activation Functions

In this post, I would like to focus on the idea of activation, and in particular the hyperbolic tangent as an activation function. Simply put, the activation function decides whether a node should be activated or not. 

In mathematics, it is common practice to start with the simplest model. In this case, the most basic activation functions are linear functions such as y=3x-7 or y=-9x+2. (Yes, this is the y=mx+b that you still likely recall from algebra 1.) 

However, if activation functions are linear for each layer, then all of the layers would be equivalent to a single layer by what are called linear transformations. It will take us too far afield to discuss linear transformations, but the upshot is that nonlinear activation functions are needed so that the neural network can meaningfully have multiple layers. The most basic nonlinear functions that we can think of would be a parabola (y=x^2), which can be seen in the diagram modeling some real data.

line graph representing quadratic regression best-fit line with augmented data points.

While there are a number of popular activation functions (e.g., Sigmoid/Logistic, ReLU, Leaky ReLU) that all Flatiron Data Science students learn, I’m going to discuss the hyperbolic tangent function for a couple of reasons.

First, it is a default activation function for Keras, which is the industry standard deep learning API written in Python that runs on top of TensorFlow, which is taught in detail within the Flatiron School Data Science Bootcamp.

Second, the hyperbolic function is an important function even outside of machine learning and worth learning more about. It should be noted that the hyperbolic tangent is typically denoted as tanh, which to the mathematician looks incomplete since it lacks an argument such as tanh(x). That being said, tanh is the standard way to refer to this activation function, so I’ll refer to it as such.

line graph titled "tanh" with two lines - original (y) and derivative (dy)

Neural Network Hyperbolic Functions

The notation of hyperbolic tangent is pointing at an analog to trigonometric functions. We hopefully recall from trigonometry that tan(x)=sin(x)/cos(x). Similarly, tanh(x)=sinh(x)/cosh(x), where sinh(x) = (e^x-e^-x)/2 and cosh(x) = (e^x+e^-x)/2.

So we can see that hyperbolic sine and hyperbolic cosine are defined in terms of exponential functions. These functions have many properties that are analogous to trigonometric functions, which is why they have the notation that they do. For example, the derivative of tangent is secant squared and the derivative of hyperbolic tangent is hyperbolic secant squared.

The most famous example of a hyperbolic function is the Gateway Arch in St. Louis, MO. The arch, technically a catenary, was created with an equation that contains the hyperbolic cosine.

Image of the Gateway Arch in St. Louis, MO

(Note: This image is in the public domain)

High voltage transmission lines are also catenaries. The formula for the description of ocean waves not only uses a hyperbolic function, but like our activation function uses than.

Hyperbolic Activation Functions

Hyperbolic tangent is a sigmoidal (s-shaped) function like the aforementioned logistic sigmoid function. Where the logistic sigmoid function has outputs between 0 and 1, the hyperbolic tangent has output values between -1 and 1. 

This leads to the following advantages over the logistic sigmoid function. The range of [-1, 1] tends to make:

  •  negative inputs mapped to strongly negative, zero inputs mapped to near zero, and positive inputs mapped to strongly positive on the tanh graph
  • each layer’s output more or less centered around 0 at the beginning of training, which often helps speed up convergence

The hyperbolic tangent is a popular activation function that is often used for binary classification and in conjunction with other activation functions that have many nice mathematical properties.

Interested in Learning More About Data Science?

Discover information about possible career paths (plus average salaries), student success stories, and upcoming course start dates by visiting Flatiron’s Data Science Bootcamp page. From this page, you can also download the syllabus and gain access to course prep work to get a better understanding of what you can learn in the program, which offers full-time, part-time, and fully online enrollment opportunities.

Why Do We Need Statistics for Data Science?

I began the preceding post “Learning Mathematics and Statistics for Data Science” with the following definition: Data science is used to extract meaningful insights from large data sets. It is a multidisciplinary approach that combines elements of mathematics, statistics, artificial intelligence, and computer engineering. Previously, I described why we need mathematics for data science and in this article I’ll answer the companion question: 

Why do we need statistics for data science? 

However, before we can turn to that question, we need to talk about statistics in general.

What is Statistics?

Statistics is the study of variation of a population using a sample of the population. 

For example, suppose we want to know the average height of American adult men. It is impractical to survey approximately all 80 million adult men in the United States; we’d just survey the heights of a (random) sample of them. This leads us to the two types of statistics that we need to know for data science: viz. descriptive statistics and inferential statistics.

The Two Types of Statistics for Data Science

Descriptive statistics is the branch of statistics that includes methods for organizing and summarizing data. Inferential statistics is the branch of statistics that involves generalizing from a sample to the population from which the sample was selected and assessing the reliability of such generalizations. Let’s look at some examples of each.

Descriptive Statistics

Data can be represented with either images or values. The following graph is a histogram of the distribution of heights of American adult men in inches.

graph chart showing the bell curve of the distribution of height of american men.

You are likely familiar with common descriptive statistics such as means, medians, and standard deviations. For example, the average height of American men is 69 inches. While there are more sophisticated descriptive statistics than these, they all serve the same purpose: to describe the data in some way.

Inferential Statistics

Inferential statistics uses descriptive statistics and probability to draw inferences regarding the data. The most common types of inferential statistics are confidence intervals and hypothesis testing. 

Confidence intervals allow us to estimate an unknown population value (e.g., the height of American men in inches). 

A hypothesis test is a method that helps decide whether the data lends sufficient evidence to support a particular hypothesis; for example,is the average height of American men greater than 69 inches? 

Returning to our definition of statistics, we can see that the fundamental issue that statistics is dealing with, whether it is descriptive or inferential, is variation in the sample data. Also, using the sample data to draw conclusions about the population or populations of interest.

While statistics is used in many disciplines and applications, the way that data science uses statistics has some unique attributes. That said, what we have described forms the basis of the statistics that are used in data science. So let’s turn to that. I will note, of course, that we’re speaking in generalizations (the details are discussed in programs like the Flatiron School Data Science program). 

There are three primary goals using data science with statistics.

  1. Regression
    • Predicting an attribute associated with an object
  2. Classification
    • Identifying which category an object belongs to
  3. Clustering
    • Automatic grouping of similar objects

Statistical Learning in Data Science

Regression, or prediction, models are the oldest of the machine learning models; in fact, they preexist computers. The basic version (simple linear regression) of these models is taught in introductory statistics classes. In the case of linear regression, the idea is to get a line that best fits two variable data.

Line graph example with sample data points

For other more sophisticated models, the idea is the same. For example, it could be the case that the data can be better modeled by a function other than a line such as a quadratic curve, as seen in the below image.

Line graph showing quadratic regression best-fit line with actual data points

Classification models are used to determine which group a particular class a datum belongs to. The canonical example comes from the iris dataset where the data contains the sepal length (cm), sepal width (cm), and the three classes of irises: Setosa, Versicolor, and Virginica. The popular K-nearest neighbors model classifies each iris based on the sepal measurements, as can be seen in the below image.  

chart showing 3-class classifications

Clustering models seem similar to classification models since the algorithms are also grouping data together; however, there is a fundamental difference. Classification models are an example of supervised learning. In other words, the machine learning algorithm is able to train on data (known as labels) with known outcomes so it can classify with new data. Clustering models are an example of unsupervised learning, where the algorithm is determining how to group the data. Clustering models like K-means separate the data into groups with similar statistical properties.

k-means clustering on the digits dataset (PCA-reduced data) Centroids are marked with white cross.

A common place where K-means is used is for customer or market segmentation of data, which is the process of clustering a customer base into distinct clusters of individuals that have similar characteristics.

The reason for statistics for data science is that we need statistics to understand the data through descriptive and inferential statistics. Further, in order to use the power of artificial intelligence, we need to be able to use statistical learning techniques.

Ready To Get Started In Data Science?

As a next step, we’d encourage you to Try out our Free Data Science Prep Work to see if data science is right for you.

If you realize you like it, apply today to start learning the skills you need to become a professional Data Scientist.

Not sure if you can do it? Read stories about students just like you who successfully changed careers on the Flatiron School blog.

Learning Mathematics and Statistics for Data Science

Amazon Web Services helpfully informs us that data science extracts meaningful insights from large data sets and that it is a multidisciplinary approach that combines elements of mathematics, statistics, artificial intelligence, and computer engineering. The Bureau of Labor Statistics reports that data scientist is tracking to be the fastest growing data-related occupation through 2031. 

For someone who has a desire to learn data science it can understandably be quite daunting to learn the broad and technical skills needed to become a successful data scientist. In particular, those of us considering becoming a data scientist often have reservations about the mathematics and statistical knowledge needed. While studying data science one uses calculus; probability and probability distributions; descriptive and inferential statistics to include linear and logistic regressions; and linear algebra (among additional mathematical and statistical concepts).

Data visualization

This blog post is the first in a series that looks at the mathematics and statistics needed to succeed in data science. This inaugural post considers this fundamental query: why do we need mathematics for data science? The sequel will appraise the direct follow-up question: why do we need statistics for data science?

The remainder of this blog series will examine particular examples of mathematical or statistical concepts in use in data science.

Mathematical Models and Treating Data Science as a Science

So why do we need mathematics for data science? An obvious answer is that data is often numbers (called numeric or quantitative data), e.g., the number of siblings one has or the height (in inches) of NBA players. While this is certainly true, and speaks to the first word of the phrase “data science,” the overarching reason the student of data science must be conversant in a myriad of fields of mathematics is because data science is a science. One of the hallmarks of a science is that it can be explained using mathematical models with the germane notation.

Newton's Second Law of Motion

Second Law Of Motion

One of the most famous examples of a mathematical model comes from 1687 and it is still fundamental to the study of classical mechanics. Newton’s Second Law of Motion states that a force is equal to mass times acceleration, or in terms of a mathematical model using the appropriate notation:

Mathematical equation of Newton's second law of motion

(The “bar” here represents what we call a vector.)

The purpose with this example is not to dwell on what vectors are in particular or what physics is in general, but to highlight the importance of mathematics to science. Newton’s idea of the relationship between force and the product of mass and acceleration becomes understandable and usable by others when put into mathematical notation.

The goal of the data scientist is the same as that of the physicist, but instead of trying to understand how the universe works, the data scientist wants to understand how the data works. Among other goals, the data scientist wants to:

  • Classify which category an object belongs to
  • Predict an attribute associated with an object
  • Automatic clustering of similar objects into groups

All of these can only be done with mathematical models as the foundation of the algorithms that are implemented in the programming language of the data scientist’s choice (typically Python or R).

A Toy Example

Let’s consider a toy example to close this blog post, but one that we’ll return to with a follow-up post. While any memory of what was covered in beginning algebra may be quite hazy for us, it is typically recalled that y = mx + b represents the equation of the line, where m and b are, respectively, the slope (“rise over run”) and y-coordinate of the y-intercept (“where the line crosses the vertical axis”) of a line.

Suppose we have a slope that runs 5 units in the positive x direction for every 9 units it rises in the positive y direction, i.e.,

Mathematical equation

Further, let’s assume that the y-intercept is 0,32.

Data science graph

Thus, the equation of this line is y = 1.8x + 32, which you can see above. This likely seems a relatively innocuous line and perhaps even arbitrarily constructed. But let’s think about it for a moment by replacing x with some values as can be seen in this table. We can see if we let x equal to 0, then y is equal to 32 and so on.

Data science graph

Temperature Conversion Equation

Perhaps we don’t quite see the relevance yet, so let’s rename the two variables and rewrite the slope as a fraction:

Mathematical equation of converting fahrenheit to celsius

Yes, we see it now. This line models the relationship between Fahrenheit and Celsius.

While we can use the equation of a line to model a formulaic relationship between two different units of temperature, what is the moral of the story for this first post? The essential idea is that we are able to use a line to model a relationship between two variables of interest. In this case, the variables were Fahrenheit and Celsius, but we can also use a so-called regression line to model two variables that have come from data that are not perfectly linear (see below; we’ll return to this idea in a future post). Ergo, mathematics allows us to model data, which is why we need mathematics in data science.

Data points on a regression line

Taylor Swift and Data Science: An Unlikely Duo

Data is everywhere, but one thing that might be more ubiquitous than data is Taylor Swift. The recent article “Taylor’s Towering Year”—authored by Posit (formerly RStudio)—illustrates several ways in which the two are not mutually exclusive by showing the data behind her record-breaking Eras Tour. In the article, they break down the tour’s staggering ticket sales, profound effect on worldwide economies, and boost in popularity for Taylor’s opening acts. Let’s discuss how Posit accomplished this and show you a concert tour visualization of our own.

Quarto

First released in early 2021, Quarto, the tool behind the Eras Tour article, is an open-source publishing system designed to weave prose and code output into dynamic documents, presentations, dashboards, and more. Paired with a variety of ways to publish and share your content, it is an excellent platform for data storytelling. 

Deciding to learn R vs. Python is a well-covered topic and often one prone to heated debate. In Quarto, there’s no “Bad Blood” between the two popular programming languages, where you can choose to run your project in R, Python, or both. It’s also compatible with the Julia and Observable JS languages as well as many of the most popular integrated development environments (IDEs) used in the field of data science, like VS Code, Jupyter, and RStudio. This flexibility means data scientists can collaborate on projects using the tools of their choice. 

How Quarto Generated the Eras Tour Data

Notice the See the code in R link in the left sidebar of Posit’s article that takes you to a virtually identical page. The key difference is this page allows you to see the code behind the data collection and visualizations. We won’t go line-by-line, but let’s look at the high-level steps they took to craft the “GDP of Taylor” data visualization toward the top of the article.

Data Collection

Expand the “See R code” section just above “The GDP of Taylor” visualization to see the first code chunk where Posit starts by web scraping the Wikipedia page for nominal GDP by country. Web scraping is a technique in which you write code to visit a website and return information or data. Be sure to read the terms and conditions of a website found in the robots.txt file that tells you what information you may scrape. 

Data Cleaning

Since Taylor was estimated to stimulate the economy by over $6 billion, the collected data is filtered to countries with GDPs between $4 and $10 billion for comparisons of similar magnitude. Next, Posit plots the map and GDP of each of those eight countries using the R library, ggplot2. Lastly, they stitch everything together with Taylor’s image and economic impact in the center using the cowplot library. Starting with several discrete plots and organizing them together, they are able to create an infographic that puts the Eras Tour in shocking perspective.

This is a great example of data science in action. As data scientists we’re often asked questions or have hypotheses but are not handed a tidy dataset. Instead, we must connect to an API or find data online, automate the process of collecting it, and manipulate it into a format that will be conducive to our analysis. Data collection and cleaning are often the iceberg below the surface while visualizations and predictive models are the parts everyone can see. Without good data, it’s incredibly difficult to produce insightful analyses.

Flatiron’s Highest-Grossing Concert Tours Data Visualization

Like Posit, we collected the data from the List of highest-grossing concert tours page on Wikipedia. Instead of a static chart, we created a bar chart race—a fun way to visualize data changing over time using animation. Below we have the highest single-year tours by gross revenue from 1993 to 2023. 

A gif showing the highest-grossing musical tours by year.

The Rolling Stones and U2 tours held most of the top five spots for a majority of the past 30 years. That is, until the 2023 Eras Tour nearly doubled the $617 million grossed by the A Bigger Bang Tour—the 17-year record-holder set by the Stones in 2006. Interestingly, Taylor Swift is the first female solo artist to crack the list since Madonna’s The MDNA Tour in 2012. With the Eras Tour projected to bring in another $1 billion in 2024, Taylor Swift may take the top two spots come end of year.

This analysis was originally created in our own internal Quarto project at Flatiron School and copied over here onto our blog. Give Quarto a try and you might just tell Jupyter notebooks and RMarkdown, “We Are Never Ever Getting Back Together.”

Header image credited to Posit

Jasmine Huang: Business to Data Science

Jasmine Huang, a February 2023 Data Science graduate, spent the first decade of her career working in customer-facing roles. After attending Flatiron School, she’s finally working with what she loves – numbers and data!

Background

Jasmine Huang’s career began in business – first earning an MBA in finance and risk management, and then working in banking, real estate, and international purchasing. Despite the technical aspects of the industries, her positions were heavily customer-facing.

“Everything I did in the past was sales and customer relationship-focused,” she said, “and I was burnt out from handling people’s problems.” 

After 12 years in similar positions, she decided to follow an interest in numbers to a new career path in Data Science. 

“I always thought I could enjoy working with data and numbers since I have always been very good with Excel,” she explained. “I wanted to enhance my analytical ability.”

Flatiron School Experience

Switching careers as a full-time working mom wouldn’t be easy though – a fact Jasmine was well aware of. The time constraints of family and employment led her to an accelerated bootcamp program, designed to expedite career changes.  

“I’m a mom of a 6-year-old boy so I don’t have much free time for myself,” she said. “Joining a bootcamp for 4 months as a full-time student was the longest I could manage. If the program lasted longer than 4 months, I don’t think I could have made it work.” 

With her eyes set on graduating in 4 months, Jasmine joined Flatiron School’s Data Science Live program. In the accelerated Live course, students attend class full-time to be ready to apply to industry roles in just 15 weeks. Like most learners, transitioning into the fast-paced program initially led to some growing pains for Jasmine. 

“Learning so much in a very short time was hard,” Jasmine admitted, “but I loved everything in the program. The projects especially were realistic and useful.”

Try Out Data Science Prep – No Strings Attached!

Try Today

After completing the program in February 2023, Jasmine jumped directly into the job search – supported by her dedicated Flatiron School Career Coach.

“Job searching was definitely harder than the bootcamp,” she recalled. “It’s a full-time job to look for a job! My career coach was awesome and always there when I needed advice or interview practice. They helped me stay on track.”

A few weeks later, Jasmine landed her first position in data science as an Actuarial Data Analyst at Verus Specialty Insurance, an underwriting management company. 

Working In The Field

Coming up on 9 months in her new profession, Jasmine only has good things to report.

“I love my job as an Actuarial Data Analyst, it’s exactly what I dreamed of.”

She’s also had the opportunity to leverage her previous experience in her new career.

“With my background in Insurance and the new skills learned in the bootcamp, I made a project to provide a solution to solve a real business problem from a unique point of view.”

Reflecting On Her Journey

Looking back on her career path so far, Jasmine’s main takeaway is the importance of using the tools at one’s disposal to problem solve.

“The bootcamp taught me how to find answers on my own,” she said. “Knowing how to google and what to look for is the key to success!”

As for her advice to current Flatiron students, Jasmine recommends leaning on available resources during the job search. 

“Graduating from the bootcamp is easier than you think. The hard part is to find a job you like. If you follow Flatiron’s instructions and guidance and work hard, you will definitely find a job close enough to your dream job. It will all be worth it!”

Ready For A Change, Just Like Jasmine Huang?

Apply Now to join other career changers like Jasmine in a program that sets you apart from the competition. 

Need more time to be ready to apply? Try out our Free Data Science Prep Work and test-run the material we teach in the course. Or, review the Data Science Course Syllabus that will set you up for success and help launch your new career.

Read more stories about successful career changes on the Flatiron School blog.