Bartosz is a data engineer enjoying working with Apache Spark and cloud data services. By day he works as a data engineering consultant at OCTO Technology. By night, he shares his data engineering findings on waitingforcode.com and becomedataengineer.com.
November 18, 2020 04:00 PM PT
If you want to extend Apache Spark and think that you will need to maintain a separate code base in your own fork, you're wrong. You can customize different components of the framework, like file commit protocols or state and checkpoint stores.
After the talk, you should be aware of the available customization strategies and be able to implement them on your own.
Speaker: Bartosz Konieczny
October 15, 2019 05:00 PM PT
Analyzing sessions can bring a lot of useful feedback about what works and what does not. But implementing them is not easy because of data issues and operational costs that you will meet sooner or later. In this talk I will present 2 approaches to compute sessions with Apache Spark and AWS services. The first one will use batch and therefore, Spark SQL, whereas the second streaming and Structured Streaming module.
During the talk I will cover different problems you may encounter when creating sessions, like late data, incomplete dataset, duplicated data, reprocessing or fault-tolerance aspects. I will try to solve them and show how Apache Spark features and AWS services (EMR, S3) can help to do that. After the talk you should be aware of the problems you may encounter with session pipelines and understand how to address them with Apache Spark features like watermarks, state store, checkpoints and how to integrate your code with a cloud provider.