Sergii Mikhtoniuk

Founder, Kamu Data Inc.

Sergii is a Software Architect and a polyglot engineer with experience ranging from hardware design, compiler development, and computer graphics, to highly responsive scalable distributed systems and data pipelines. In his role at Activision Blizzard he is responsible for technical direction of the online platform that powers many of the world’s most popular video games. He is also a long-time personal analytics and open data enthusiast and a founder of Kamu – a company that is helping the world to make sense of its growing supply of data.

Past sessions

Summit Europe 2020 Building a Distributed Collaborative Data Pipeline with Apache Spark

November 17, 2020 04:00 PM PT

The year of COVID-19 pandemic has spotlighted as never before the many shortcomings of the world's data management workflows. The lack of established ways to exchange and access data was a highly recognized contributing factor in our poor response to the pandemic. On multiple occasions we have witnessed how our poor practices around reproducibility and provenance have completely sidetracked major vaccine research efforts, prompting many calls for action from scientific and medical communities to address these problems.

Breaking down silos, reproducibility and provenance are all complex problems that will not disappear overnight - solving them requires a continuous process of incremental improvements. Unfortunately, we believe that our workflows are not suited even for that. Modern data science encourages routine copying of data, with every transformation step producing data that is disjoint from its source. It's hard to tell where most data comes from, how it was altered, and no practical way to verify that no malicious or accidental alterations were made. All of our common data workflows are in complete contradiction with the essential prerequisites for collaboration and trust, meaning that even when the results are shared they often cannot be easily reused.

This talk is the result of 2 years of R&D work in taking a completely different perspective on data pipeline design. We demonstrate what happens if the prerequisites for collaboration such as repeatability, verifiability, and provenance are chosen as core properties of the system. We present a new open standard for decentralized and trusted data transformation and exchange that leverages the latest advancements of modern data processing frameworks like Apache Spark and Apache Flink to create a truly global data pipeline. We also present a prototype tool that implements this standard and show how its core ideas can scale from a laptop to a data center, and into a worldwide data processing network that encourages reuse and collaboration.

What you will learn:
- Shortcomings of the modern data management workflows and tools
- The important role of the temporal dimension in data
- How latest data modeling techniques in OLTP, OLAP, and stream processing converge together
- How bitemporal data modeling ideas apply to data streams
- How by combining these ideas we can satisfy all preconditions for trust and collaboration
- A summary of the proposed "Open Data Fabric" protocol for decentralized exchange and transformation of data

Speaker: Sergii Mikhtoniuk