Navigating the fast-changing currents of the tech sector can challenge even the strongest sailor. The winds change, the ocean swells, and suddenly a ship that seemed completely seaworthy is wrecked on the rocks of irrelevance. New technologies emerge from the storm every day, promising grand sights just over the horizon. Nowhere is this more true than with the two racing yachts of applied data: data science and artificial intelligence (AI).
What is Artificial Intelligence?
Before going any further, I’d like to make clear that for the purposes of this article, “AI” refers to the generative models that have captured the public imagination over the last two years – applications like OpenAI’s ChatGPT and DALL-E. The applications described as “AI” in the news are algorithms trained for specific tasks on specific datasets. If an algorithm has been trained to play chess, it will play chess exactly as well as it has been trained to. It will not be able to carry on a human-sounding conversation about chess, or even explain why it makes its moves. That said, if you haven’t played around with ChatGPT or DALL-E or any of these generative models, you should. They’re really a blast!
These two apparent competitors have traded off the lead several times, at least in the imaginations of the tech press and blogosphere. Five years ago, “data scientist” was still the sexiest career of the 21st century. Now, the headline writers seem to think that data science will be rendered obsolete by the advent of AI. Someone who hadn’t been paying any attention at all might even think at this point that “data science” and “AI” are interchangeable. The reality, though, is that AI is really just an application of data science – a set of tools and methods for making data useful.
The Modern Flood Of Data
The world is drowning in data. Estimates put daily data generation globally at 328.77 quintillion bytes. (A quintillion is a 1 followed by 18 zeros!) That’s approximately 328 million maxed-out iPhone 13s – enough for just about every human in the United States. It is practically beyond conception. That volume represents everything: every video, picture, email, and spreadsheet. Every like, favorite, and follow is recorded and stored, but so is every action taken by an Internet of Things (IoT) device like your doorbell video camera, or your fitness tracker.
Data sitting in a data warehouse doesn’t do anyone any good, though. Someone has to make sense of it and make it useful. That is what data science does. Using code, statistics, and machine learning, data science is the field concerned with extracting knowledge and insight from the floods of data running around the world.
What Does A Data Scientist Do?
We can think of the data scientist as a detective. Clad in a deerstalker cap and Inverness cape or, more likely, a hoodie and noise-canceling headphones, they meticulously sift through the mountains of data looking for gems of insight and understanding. The data scientist uses their analytical skills and abilities, and old-fashioned human curiosity and intuition, augmented by machine learning algorithms and statistical analysis, to pose interesting questions of data. They develop inferential models that can help companies better understand and serve their customers. And they build predictive models that ensure customers find what they are looking for before they know they’re looking for it!
In broad strokes, a data scientist’s job is to make sense of data, to make it useful in some context. And while data science is a tremendous amount of fun to do on its own, data scientists earn our salaries by being useful to businesses or other enterprises.
Data Science In Action
Let’s take a look at a hypothetical project involving online shopping and customer service. This example will take us through a data science workflow, even up to the construction of a generative model that could help a retailer improve its customer experience. (It should go without saying that not every data scientist can do everything described in this example. This is a rather fanciful scenario to illustrate the vast breadth of work that falls under the heading of data science.)
Imagine a large online retailer called Congo. As a retail platform, Congo wants to make sure that customers make as many purchases as possible on their platform, in part by providing a positive customer experience.
Asking A Question
A data scientist, charged with improving customer service, begins their work with a question. The question could be simple, like “What do people contact customer support the most often for?” It could be more complex like “How can we improve customer satisfaction while reducing the amount of time a representative spends on a ticket?” The important thing is to have a clear question that, in theory, can be answered with data.
Collecting Data
Along with selling products to customers, Congo collects information about what people are doing on its website, every interaction from the first log-in to the final purchase is tracked and recorded. This goes for interactions with customer support as well as regular interactions. The data scientist can have access to all of this information, but will usually perform their first forays into answering their question on a portion of the data, a dataset. This dataset is collected from whatever database, data warehouse, or data lake Congo uses, and refined using SQL or some other query language.
Initial Analysis
Once they have data in hand, the data scientist can begin analyzing it for patterns. The specifics of how they do this depend on their tech stack, but also on the question they are asking. The two main stacks (‘stack’ is tech-speak for a person’s or organization’s preferred collection of programming tools and frameworks) for data science at the moment build on the R and Python programming languages. Other languages, like Rust and Julia, are growing in popularity, but have not yet gained major holds in the field.
The initial stages of answering a question tend to look the same. A data scientist needs to get a handle on what is actually in their dataset, and they almost invariably start with descriptive analysis and data visualization. Descriptive analysis is built on the basic calculations of things like mean, median, and mode, as well as variance and standard deviation. Unfortunately, numbers alone don’t tell a complete story, so the data scientist will also likely produce a number of visualizations: scatter plots, histograms, and heatmaps for correlation coefficients.
Natural Language Processing
Since the questions our data scientist is attempting to answer involve things like problems submitted to customer support, they have to use an area of data science called natural language processing (NLP) in order to get the data into a format that can be analyzed. Computers, for now, do not really know what to do with natural language. If you were to give a regular scripting language, or even Excel, a plain sentence like “The quick brown fox jumps over the lazy dog” and asked it to sum the words, you would get an error. NLP lets data scientists get around this problem, among other things, by creating computer-friendly representations of the words, called tokens.
If the question is a simple one, like “What is the most common complaint on a Tuesday?” or “How long does the average ticket take to get closed?” the data scientist’s work is likely to end with this initial analysis. (In point of fact, a question this simple is more likely to go to a business analyst or data analyst than to a data scientist, but a good data scientist has to be familiar with every stage of the process.)
Remember that the data scientist’s ultimate goal in this scenario is to improve Congo’s customer service with basic analysis. A more complicated question, like, “How can we improve time-to-close on our customer support tickets?” requires more advanced tools and techniques than simple means and standard deviations. When looking for answers to a question like this, the data scientist is likely to turn to machine learning.
Using Machine Learning To Answer Complex Questions
Machine learning (ML) is a set of algorithms and processes that enable computers to ‘learn’ by recognizing patterns in data and using those patterns to identify new examples of that pattern. Keeping with our customer service example: if we showed a computer thousands of complaints (the data) and told it how to look at those complaints (the algorithm), it would pick out all the things that are similar among those complaints and use those similarities to identify complaints in new customer service requests. This particular example of machine learning is called “classification,” which is sorting things into groups based on common attributes. Depending on how it is implemented, it could also involve an NLP task called “sentiment analysis,” which evaluates the emotional content of text.
Depending on how far our data scientist wanted to go, an ML-based filter identifying customer service tickets that are simply complaints could be sufficient for the task. With the filter in place, the customer service team could create a protocol for dealing with messages flagged as ‘complaint’ and improve their clearance time that way.
But let’s say that our data scientist is not satisfied with good enough. With their new ML-enabled filter, the data scientist could process every customer support ticket Congo has ever received, effectively creating a new dataset of complaints. They could build a new algorithm on the complaints data to identify which complaints were resolved successfully, based on customer feedback for the interaction, and use the results of that analysis to inform a response protocol for customer service representatives to use.
The Role Of AI Assistants
Having done quite a lot of analysis at this point of what is and is not a successful customer complaint interaction, it may even make sense for the data scientist to create an AI assistant just for handling complaints. Something like ChatGPT (GPT stands for ‘Generative Pre-Trained Transformer’) requires a mind-boggling volume of data to train effectively. A recent estimate of its training data put it at around 600GB. For comparison, the whole of all the English language text on Wikipedia is about 25GB. But with sufficient data, and a good model selected, a data scientist could build a proof-of-concept assistant without too much difficulty. Of course, scaling it to support an entire customer service team would require the services of data engineers, software developers, and a host of other technical specialties. It may even turn out to be too expensive in terms of a straight cost-benefit analysis to implement, but it is absolutely within the realm of possibilities.
The Future Of Artificial Intelligence vs. Data Science
Data science and AI exist in an elegant symbiosis. AI requires enormous amounts of data to produce its wondrous creations. It uses algorithms to learn the patterns contained in trillions of words written by actual humans. All of that data requires processing, selection, and analysis by data scientists, who also have to observe the AI’s behavior and monitor it for flaws. Data scientists, furthermore, develop the algorithms and models that help drive improvements to AI. Although it might seem like the advent of generative models has rendered the data scientist obsolete, or at least endangered, the reality is that a good data scientist will learn how to use the strengths of the generative models to help with their own work.
About Charlie Rice
Charlie Rice is a Data Science Instructor with the Flatiron School. A former journalist, he got interested in data science when an editor showed him a news story written by a computer. Since making the switch eight years ago, he’s been involved in blockchain research, FinTech development, and helping to develop the forthcoming CompTIA Data Science certification.