Craig is a Software Engineer in the Data Science Platform Engineering team at Sam’s Club. He works building data pipelines using Apache Airflow and developing cloud-based platforms for data scientists and engineering teams to analyze data, apply machine learning models, and make production applications. He has developed and built data pipelines using Apache Airflow that efficiently scale the movement of over a half-trillion rows of on-premise data to the Azure cloud platform. He has extensive experience with Hadoop, Spark, machine learning, cloud platforms, and Airflow. Craig graduated from the University of Oklahoma with a Bachelor’s Degree in Mathematics and is based in the Sam’s Club offices in Bentonville, Arkansas.
April 23, 2019 05:00 PM PT
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters.
As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage.
We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.