Pandas Dataframe - Databricks

Pandas Dataframe

Glossary Item
« Back to Glossary Index
Source Databricks

Pandas is an open source, BSD-licensed library written for the Python programming language that provides fast and adaptable data structures, and data analysis tools. This easy to use data manipulation tool was originally written by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame.

Pandas provide two types of Data Structures:

  • Pandas DataFrame (2-dimensional)
  • Pandas Series (1-dimensional)

Pandas uses data such as CSV or TSV file, or a SQL database and turns them into a Python object with rows and columns known as a data frame. These objects are quite similar to tables available in statistical software (e.g Excel or SPSS).

Similar to the way Excel works, Pandas DataFrame provides different functionalities. It allows you to store and manipulate tabular data in rows of observations and columns of variables, as well as to extract valuable information from the given dataset.

What is a DataFrame?

Pandas DataFrame is a way to represent and work with tabular data. It can be seen as a table that organizes data into rows and columns, making it a two-dimensional data structure.

A DataFrame can be either created from scratch or you can use other data structures like Numpy arrays.

Here are the main types of inputs accepted by a DataFrame:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

Advantages of using Pandas Dataframes:

  • Pandas Data frames can easily load data from different databases and data formats:
  • Intuitive merging and joining data sets that use a common key in order to get a complete view
  • Segment records within a data frame
  • Smart label-based slicing, creative indexing, and subsetting of large data sets
  • Quickly aggregate and summarize in order to ger eloquent stats from your data by accessing in-built functions within Pandas data frames
  • Define your own Python functions featuring certain computational tasks and apply them on your dataframe records
« Back to Glossary Index