Introduction to Data Analysis

for Aspiring Data Scientists

Overview

Join us for a four-part learning series: Introduction to Data Analysis for Aspiring Data Scientists. This self-paced online workshop series is for anyone and everyone interested in learning about data analysis. No previous programming experience required.

Each workshop page contains the session video recording, transcripts, speaker info, and a GitHub link to access the notebooks and resources. We suggest you start with Part One, Introduction to Python, and continue from there in order because each workshop builds upon the last.

If you’d like to follow along, please Sign Up for your Community Edition account or download the Delta Lake library.

Sign up for Community Edition

Introduction to Python

In this workshop, we will show you the simple steps needed to program in Python using a notebook environment on the free Databricks Community Edition.This workshop covers major foundational concepts necessary for you to start coding in Python, with a focus on data analysis. No prior programming knowledge is required.

Data Analysis with Pandas

This workshop is on pandas, a powerful open-source Python package for data analysis and manipulation. In this workshop, you will learn how to read data, compute summary statistics, check data distributions, conduct basic data cleaning and transformation, and plot simple visualizations. Although no prep work is required, we do recommend basic python knowledge. Watch Part One, Introduction to Python to learn about Python.

Introduction to Apache Spark

This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. No prior knowledge of Spark is required, but Python experience is highly recommended.