While processing more data through an existing set of ETL or ML/AI pipelines is easy with Spark, dealing with an ever expanding and/or changing set of pipelines can be quite challenging, all the more so when there are complex inter-dependencies. Workflow-based job orchestration offers some help in the case of relatively static flows but fails miserably when it comes to supporting fast-paced data production such as data science experimentation (feature exploration, model tuning, …), ad hoc analytics and root cause analysis.
This talk will introduce three patterns for large-scale data production in fast-paced environments–just-in-time dependency resolution (JDR), configuration-addressed production (CAP) and automated lifecycle management (ALM)–with ETL & ML/AI demos as well as open-source code you can use in your projects. These patterns have been production-tested in Swoop’s petabyte-scale environment where they have significantly increased human productivity and processing flexibility while reducing costs by more than 10x.
By adopting these patterns you’ll get the benefits typically associated with rigidly-planned and highly-coordinated data production quickly & efficiently, without endless meetings or even a workflow server. You will be able to transparently ensure result accuracy even in the face of hundreds of constantly-changing inputs, eliminate duplicate computation within and across clusters and automate lifecycle management.
Session hashtag: #SAISDev1
Sim Simeonov is an entrepreneur, investor and startup mentor. He is the founding CTO of Swoop and IPM.ai, startups that use privacy-preserving AI to improve patient outcomes and marketing effectiveness in life sciences and healthcare. Previously, Sim was the founding CTO of Evidon (CrownPeak) & Thing Labs (AOL) and a founding investor in Veracode (Broadcom). In his VC days, Sim was an EIR at General Catalyst Partners and technology partner at Polaris Partners where he helped start five companies the firms invested in, three of which have already been acquired. Before his days as an investor, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe). Earlier, he was a founding member and chief architect at Allaire, one of the first Internet platform companies whose flagship product, ColdFusion, ran thousands of sites such as Priceline and MySpace.