Rapid Pyspark Custom Processing on Time Series Big Data in Databricks
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Science and Machine Learning |
INDUSTRY | Health and Life Sciences, Retail and CPG - Food |
TECHNOLOGIES | Apache Spark, Delta Lake |
SKILL LEVEL | Advanced |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
Sleep Number Smartbeds were equipped with sensors underneath each leg to generate personalized sleeper insights with the weight in the bed. The raw readings were inherently noisy due to movements and position in bed. An intricate quality assessment was necessary to select stable segments with low entropy. To calculate at a granular level, a custom user-defined function for entropy was applied to rolling windows of time series big data. As the initial Pandas implementation did not suffice due to memory and time constraints, the operation was augmented using the potent synergy of Pyspark coupled with Databricks. The efficient and brute force methods were examined at varying data sizes and cluster configurations. The recommended Pyspark method rapidly processed 50 million records in nearly 0.3 seconds in Databricks. It remarkably performed convoluted custom calculations on rolling windows of time series big data in constant time complexity irrespective of data size.
SESSION SPEAKERS
Megha Rajam Rao
/Research Scientist
Sleep Number
Gary Garcia Molina
/Senior Principal Scientist
Sleep Number