There are a myriad of data science terms used in the data science field, where statistics and artificial intelligence are employed to discover actionable insights from data. For example, data science can be used by banks for fraud detection, or for content recommendations from streaming services. This post focuses on some of the key terms from statistics that are commonly used within data science and then concludes with a few remarks on using data science terminology correctly.
Definitions of Key Data Science Terms
Let’s look at some of the key terms in data science that you need to have a grasp on.
Numeric and Categorical Data
Data can be either numeric (or quantitative) or categorical (or qualitative). Numeric data represents quantities or amounts. Categorical data represents attributes that can be used to group or label individual items. If a student is a first-generation college student who is taking 17 semester units, then the student’s educational generation is categorical and the number of units is numeric.
Types of Statistics
When one is introduced to the use of statistics in data science, terms generally fall within one of the two main branches of statistics that serve different purposes in the analysis of data: descriptive statistics and inferential statistics.
Descriptive statistics summarize and organize characteristics of a data set. They give a snapshot of the data through numbers, charts, and graphs without making conclusions beyond the data analyzed or making predictions.
Descriptive Statistics: Mean, Median, Mode, Standard Deviation, and Correlation
Measures of central tendency provide a central point around which the data is distributed and measures of variability describe the spread of the data. The two most common measures of central tendency for numeric data are the mean and the median. The most common measure of central tendency for categorical data is the mode. The mean in data science is the average value (sum all of the values and divide by the number of observations). The median in data science is the middle value, and the mode is the most common value.
Note that while the mode is generally used for categorical data, numeric data can also have modes. Consider the following made-up data set that is listed in order for simplicity: 2, 3, 7, 9, 9. The mode is 9 since it is the only value that shows up more than once. The median is 7 since it is precisely the middle value, and the mean is 30/5 = 6. The most used measure of variability is the standard deviation, which can be thought of as the average distance that each observation is from the mean. In the toy example noted above, the standard deviation is 3.31. So on average, each number of the data set is 3.31 away from the mean.
All of the aforementioned descriptive statistics are for univariate data (i.e., data with only one variable). More often in data science, we look at data that is multivariate. For instance, one could have two variables—the height and weight of NBA players. A descriptive statistic that describes the relationship between these variables is called the correlation. The correlation is a value between -1 and 1 and represents the strength and direction of the relationship.
Inferential Statistics: Confidence Intervals and Hypothesis Tests
Now let’s turn to some key terms from inferential statistics that are used in data science. There are two main types of inferential statistics: confidence intervals and hypothesis tests. Confidence intervals give an estimate of an unknown population value. Hypothesis tests determine if a data set is significantly different from an assumed value regarding the population at a certain level of confidence.
For example, a confidence interval that is estimating the average (mean) height of NBA players in inches could be (75 inches, 81 inches). Whereas for a hypothesis test we can claim that the average height of NBA players is 78 inches and then test to see if our data differs substantially from that value. If our data set has a sample mean of 74 inches, then it is likely that this shows statistical significance because our mean is so different from the assumed population mean of 78 inches. While if our data set has a sample mean of 77 inches, then it is unlikely that this will show statistical significance since our sample mean and the assumed population mean are close.
For a much more technical overview of statistical significance, confidence intervals, and hypothesis testing, please see our post “Rejecting the Null Hypothesis Using Confidence Intervals.”
How to Use Data Science Terms Wisely
Time now for an anecdote. A friend of mine—let’s call him Yinzer—was giving a presentation to his boss. He was tasked with presenting descriptive statistics on the company’s data. He included in his presentation a descriptive statistic called the kurtosis since that value was produced by the software. Yinzer’s boss asked him, “What is kurtosis?” Yinzer didn’t know and was unable to answer the question.
The moral of the story is: only use data science terms such as those that we have discussed like mean, median, standard deviation, correlation, and hypothesis testing if you are confident in being able to explain them.
Some Additional Tips for Using Data Science Terminology
Here are some additional tips for using data science terminology if you are a beginner in the field:
Focus on understanding, not memorizing: Don’t try to memorize every term you encounter. Instead, focus on grasping the underlying concepts and how they relate to each other. This will allow you to learn new terms organically as you progress.
Practice with real data: The best way to solidify your understanding is to apply it. Find beginner-friendly datasets online and use them to practice basic data cleaning, analysis, and visualization. This will expose you to terminology in a practical setting.
Engage with the data science community: Join online forums, attend meetups, or connect with other data science beginners. Discussing concepts and terminology with others can solidify your understanding and expose you to new terms in a collaborative environment.
Learn Data Science at Flatiron in 15 Weeks
Full-time students in Flatiron’s Data Science Bootcamp can graduate in under four months with the skills needed to land data analyst, AI engineer, and data scientist jobs. Book a 10-minute call with our Admissions team to learn more.