Since the invention of SQL and relational databases, data production has been about specifying how data should be transformed through queries. While Apache Spark can certainly be used as a general distributed SQL-like query engine, the power and granularity of Spark’s APIs allows for a fundamentally different, and far more productive, approach. This session will introduce the principles of goal-based data production with examples ranging from ETL, to exploratory data analysis, to feature engineering for machine learning.
Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to the smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns and live demos, this session will demonstrate how easy it is for any company to create its own smart data warehouse with Spark 2.x and gain the benefits of goal-based data production.
Session hashtag: #SFexp10
Sim Simeonov is the founding CTO of Swoop, a startup that brings the power of search advertising to content. Previously, Sim was the founding CTO at Ghostery, the platform for safe & fast digital experiences, and Thing Labs, a social media startup acquired by AOL. Earlier, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe) and chief architect at Allaire, one of the first Internet platform companies. He blogs at blog.simeonov.com, tweets as @simeons and lives in the Greater Boston area with his wife, son and an adopted dog named Tye.