Understanding Data: A Beginner’s Guide to Data Types and Structures

-Beginners-Guide-to-Data-Types-and-Structures

Explore the essentials of data types and structures in this beginner’s guide. Learn about integers, strings, floats, lists, dictionaries and more.

Reading Time 7 mins

Coding a function, an app, or a website is an act of creation no different from writing a short story or painting a picture. From very simple tools we create something where there never was something. Painters have pigments. Writers have words. Coders have data types.

Data types govern just about every aspect of coding. Each type represents a specific kind of thing stored in a computer’s memory, and has different ways of being used in writing code. They also range in complexity from the humble integer to the sophisticated dictionary. This article will lay out the basic data types, using Python as the base language, and will also discuss some more advanced data types that are relevant to data analysts and data scientists.

Computers, even with the wondrous innovations of generative AI in the last few years, are still— just as their name suggests—calculators. The earliest computers were people who performed simple and complex calculations faster than the average person. (All the women who aided Alan Turing in cracking codes at Bletchley Park officially held the title of computers.)

A hierarchical map of all the data types in Python
A hierarchical map of all the data types in Python. This article only discusses the most common ones: integers, floats, strings, lists, and dictionaries.
Source: Wikipedia

Simple Numbers

It’s appropriate, then, that the first data type we discuss is the humble integer. These are the whole numbers 0 to 9, in their millions of combinations from 0 to 999,999,999,999 and beyond, including negative numbers. Different programming languages handle integers differently.

Python supports integers, usually denoted as int, as “arbitrary precision.” This means it can hold as many places as the computer has memory for. Java, on the other hand, recognizes int as the set of 32-bit integers ranging from -2,147,483,648 to 2,147,483,647. (A bit is the smallest unit of computer information, representing the logical state of on or off, 1 or 0. A 32-bit system can store 2^32 different values.)

The next step up the complexity ladder brings us to floating point numbers, or more commonly, floats. Floating point numbers approximate real numbers that include decimalized fractions from 0.0 to 99.999 and so on, again including negative numbers and repeating to the limits of a computer’s memory. The level of precision (number of decimal places) is constrained by a computer’s memory, but Python implements floats as 64-bit numbers.

With these two numeric data types, we can perform calculations. All the arithmetic operations are available with Python, along with exponentiation, rounding, and modulo operation. (Denoted %, modulo returns the remainder of division. For example, 3 % 2 returns 1.) It is also possible to convert a float to an int, and vice versa.

From Numbers to Letters and Words

Although “data” and “numbers” are practically synonymous in popular understanding, data types also exist for letters and words. These are called strings or string literals

“Abcdefg” is a string.

So is:

“When in the course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another.”

For that matter so is:

“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur vitae lobortis enim.”

Computer languages don’t care whether the string in question is a letter, a word, a sentence, or complete gobbledygook. They only care, really, whether the thing being stored is a number or not a number, and specifically whether it is a number the computer can perform arithmetic on. 

An aside: to humans, “1234” and 1234 are roughly the same. We humans can infer from context whether something is a calculable number, like on a receipt, or a non-calculable number, like a ZIP code. “1234” is a string. 1234 is an integer. We call those quotation marks “delimiters” and they serve the same role for the computer that context serves for us humans.  

Strings come with their own methods that allow us to do any number of manipulations. We can capitalize words, reverse them, and slice them up in a multitude of ways. Just as with floats and ints, it’s also possible to tell the computer to treat a number as if it were a string and do those same manipulations on it. This is handy when, for example, you encounter a street address in data analysis, or if you need to clean up financial numbers that someone left a currency mark on.

Encoding numbers and letters is a valuable tool, but we also need ways to check that things are what they claim to be. This is the role of the boolean data type, usually shortened to bool. This data type takes on one of two values: True or False, although it is also often represented as 1 or 0, respectively. The bool also gives rise to the boolean operators (<,>, !=, ==) which evaluate the truth value of some expression, like 2 < 3 (True) or “Thomas” == “Tom” (False).

Ints, floats, and strings are the most basic data types that you will find across computing languages. And they are very useful for storing singular things: a number, a word. (Strictly speaking, an entire library could be encoded as a single string.)

These individual data types are great if we only want to encode simple, individual things, or make one set of calculations. But the real value of (digital) computers is their ability to do lots of calculations very very quickly. So we need some additional data types to collect the simpler ones into groups.

From Individuals to Collections

The first of these collection data types is the list. This is simply an unordered collection of objects of any data type. In Python, lists are set off by brackets ([]). (“Ordered” means that the elements are arranged any old way, and not sorted high-to-low or low-to-high.)

For example, the below is a list:

[1, 3.7, 2, 3.4, 4, 6.74, 5.0]

 This is also a list:

[“John”, “Mary”, “Sue”, “Alphonse”]

Even this a list:

[1, “John”, 2.2, “Mary”, 3] 

It’s important to note that within a list, each element (or item) of the list still operates according to the rules of its data type. But know that the list as a whole also has its own methods. So it’s possible to remove elements from a list, add things to a list, sort the list, as well as do any of the manipulations the individual data types support on the appropriate elements (e.g., calculations on a float or an integer, capitalize a string).

The last and probably most complicated of the basic data types is the dictionary, usually abbreviated dict. Some languages refer to this as a map. Dictionaries are unordered collections of key-value pairs, set off by braces ({}). Individual key-value pairs inside a dictionary are set off with a comma. They enable a program to use one value to get another value. So, for example, a dictionary for a financial application might contain a stock ticker and that ticker’s last closing price,like so:

{“AAPL”: 167.83,

“GOOG”: 152.62,

“META”: 485.58}

In this example, “AAPL” is the key, 167.83 is the value. A program that needed the price of Apple’s stock could then get that value by calling the dictionary key “AAPL.” And as with lists, the individual items of the dictionary, both keys and values, retain the attributes of their own data types.

These pieces form the basics of data types in just about every major scripting language in use today. With these (relatively) simple tools you can code up anything from the simplest function to a full spreadsheet program or a Generative Pre-trained Transformer (GPT). 

Extending Data Types into Data Analysis

If we want to extend the data types that we have into even more complicated forms, we can get out of basic Python and into libraries like Numpy and Pandas. These bring additional data types that expand Python’s basic capabilities into more robust data analysis and linear algebra, two essential functions for machine learning and AI.

First we can look at the Numpy array. This handy data type allows us to set up matrices, though they really just look like lists, or lists of lists. However, they are less memory intensive than lists, and allow more advanced calculation. They are therefore much better when working with large datasets.

If we combine a bunch of arrays, we wind up with a Pandas DataFrame. For a data analyst or machine learning engineer working in Python this is probably your most common tool. It can hold and handle all the other data types we have discussed. The Pandas DataFrame is, in effect, a more powerful and efficient version of the Excel spreadsheet. It can handle all the calculations that you need for exploratory data analysis and data visualization.

Data types in Python, or any programming language, form the basic manipulable unit. Everything a language uses to store information for future use is a data type of some kind, and each type has specific things it can store, and rules governing what it can do.

Learn Data Types and Structures in Flatiron’s Data Science Bootcamp

Ready to gain a deeper understanding of data types and structures to develop real-world data science skills? Our Data Science Bootcamp can help take you from basic concepts to in-demand applications. Learn how to transform data into actionable insights—all in a focused and immersive program. Apply today and launch your data science career!

Disclaimer: The information in this blog is current as of April 11, 2024. Current policies, offerings, procedures, and programs may differ.

About Charlie Rice

Charlie Rice is a senior data science instructor at Flatiron School. He has been teaching for nearly a decade professionally, and consults on FinTech and data analysis in his free time.

More articles by Charlie Rice