Skip to main content

DataFrames

What is a DataFrame?

A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

Every DataFrame contains a blueprint, known as a schema, that defines the name and data type of each column. Spark DataFrames can contain universal data types like StringType and IntegerType, as well as data types that are specific to Spark, such as StructType. Missing or incomplete values are stored as null values in the DataFrame.

A simple analogy is that a DataFrame is like a spreadsheet with named columns. However, the difference between them is that while a spreadsheet sits on one computer in one specific location, a DataFrame can span thousands of computers. In this way, DataFrames make it possible to do analytics on big data, using distributed computing clusters.

The reason for putting the data on more than one computer should be intuitive: either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.

DataFrames

The concept of a DataFrame is common across many different languages and frameworks. DataFrames are the main data type used in pandas, the popular Python data analysis library, and DataFrames are also used in R, Scala, and other languages.

Here’s more to explore

Big Book of Data Engineering

Big Book of Data Engineering: 2nd Edition

The latest technical guidance for building real-time data pipelines.

Download now
Delta Lake: Up and Running by O'Reilly

Delta Lake: Up and Running by O'Reilly

Get started using Delta Lake with O'Reillys newest eBook. A must read for step by step guidance- including code samples- so you can get to work.

Download now
the data ebgineers image

The Data Engineer’s Guide to Apache Spark + Delta Lake

For data engineers looking to leverage Apache Spark™ and Delta Lake to build faster and more reliable data pipelines.

Get the eBook
Back to Glossary