An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud

Download Slides

We have deployed a hybrid cloud storage solution that leverages compute in the public cloud along with our specialized hardware storage. We will discuss the tradeoffs of hybrid cloud storage, which workloads are best suited for this model, the pipeline we have deployed, and the challenges and best practices we have learned. Spark provides a flexible compute environment that can be used alongside todays cloud compute providers.

However in read-heavy workloads that dominate much of analysis and machine learning today, storage costs scale poorly on these same cloud storage models. Hybrid cloud offers an alternative approach to get amortized storage costs over a dedicated link while using elastic compute in the cloud. We are currently running an end to end data science stack with multiple production workloads with this setup – A Spark-based ETL for transforming the real time log data that we ingest from our devices in the field into databases, a scale-out general regular expression search over log files that provides our support engineers real time access to searching for pathologies across our customer base, and a Spark based machine learning system for time series analysis to predict various customer metrics.

Session hashtag: #HWCSAIS12



« back
About Farhan Abrol

I lead development and product on META, the machine learning engine behind our industry leading intelligence engine which collects millions of points of sensor data from our fleet and helps predict performance and capacity for planning and alerting. Previously I built a distributed sequencing system for guaranteeing data resiliency in a flash based storage system, and dabbled in research in variational inference and bayesian modeling.