Sim Simeonov is the founding CTO of Swoop, a startup that brings the power of search advertising to content. Previously, Sim was the founding CTO at Ghostery, the platform for safe & fast digital experiences, and Thing Labs, a social media startup acquired by AOL. Earlier, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe) and chief architect at Allaire, one of the first Internet platform companies. He blogs at blog.simeonov.com, tweets as @simeons and lives in the Greater Boston area with his wife, son and an adopted dog named Tye.
The General Data Protection Regulation (GDPR), which came into effect on May 25, 2018, establishes strict guidelines for managing personal and sensitive data, backed by stiff penalties. GDPR's requirements have forced some companies to shut down services and others to flee the EU market altogether. GDPR's goal to give consumers control over their data and, thus, increase consumer trust in the digital ecosystem is laudable. However, there is a growing feeling that GDPR has dampened innovation in machine learning & AI applied to personal and/or sensitive data. After all, ML & AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it's not true in general. The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-safe machine learning & AI easy. This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques--from simple hashing to advanced embeddings--for high-accuracy, privacy-safe model development. Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. Special attention will be given to Spark-based production environments. Session hashtag: #SAISDD13
While processing more data through an existing set of ETL or ML/AI pipelines is easy with Spark, dealing with an ever expanding and/or changing set of pipelines can be quite challenging, all the more so when there are complex inter-dependencies. Workflow-based job orchestration offers some help in the case of relatively static flows but fails miserably when it comes to supporting fast-paced data production such as data science experimentation (feature exploration, model tuning, ...), ad hoc analytics and root cause analysis. This talk will introduce three patterns for large-scale data production in fast-paced environments--just-in-time dependency resolution (JDR), configuration-addressed production (CAP) and automated lifecycle management (ALM)--with ETL & ML/AI demos as well as open-source code you can use in your projects. These patterns have been production-tested in Swoop's petabyte-scale environment where they have significantly increased human productivity and processing flexibility while reducing costs by more than 10x. By adopting these patterns you'll get the benefits typically associated with rigidly-planned and highly-coordinated data production quickly & efficiently, without endless meetings or even a workflow server. You will be able to transparently ensure result accuracy even in the face of hundreds of constantly-changing inputs, eliminate duplicate computation within and across clusters and automate lifecycle management. Session hashtag: #SAISDev1
Big data never stops and neither should your Spark jobs. They should not stop when they see invalid input data. They should not stop when there are bugs in your code. They should not stop because of I/O-related problems. They should not stop because the data is too big. Bulletproof jobs not only keep working but they make it easy to identify and address the common problems encountered in large-scale production Spark processing: from data quality to code quality to operational issues to rising data volumes over time. In this session you will learn three key principles for bulletproofing your Spark jobs, together with the architecture and system patterns that enable them. The first principle is idempotence. Exemplified by Spark 2.0 Idempotent Append operations, it enables 10x easier failure management. The second principle is row-level structured logging. Exemplified by Spark Records, it enables 100x (yes, one hundred times) faster root cause analysis. The third principle is invariant query structure. It is exemplified by Resilient Partitioned Tables, which allow for flexible management of large scale data over long periods of time, including late arrival handling, reprocessing of existing data to deal with bugs or data quality issues, repartitioning already written data, etc. These patterns have been successfully used in production at Swoop in the demanding world of petabyte-scale online advertising.
Since the invention of SQL and relational databases, data production has been about specifying how data should be transformed through queries. While Apache Spark can certainly be used as a general distributed SQL-like query engine, the power and granularity of Spark's APIs allows for a fundamentally different, and far more productive, approach. This session will introduce the principles of goal-based data production with examples ranging from ETL, to exploratory data analysis, to feature engineering for machine learning. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to the smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns and live demos, this session will demonstrate how easy it is for any company to create its own smart data warehouse with Spark 2.x and gain the benefits of goal-based data production. Session hashtag: #SFexp10
Best-suited for current Spark contributors and Spark package creators, the conversation will focus on how the open-source community can help Spark grow outside of the Apache project, which has strict criteria about what is in & out of scope.
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark's APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production. Session hashtag: #EUent1