Software engineer for over 10+ years, in 2016 I came to the data world as a data engineer. After working with different business domains during my career (supply chain management system, insurance start-up, video-sharing platform), recently I discovered the world of broadcasting as a data engineer at Canal+. I’m also data and Apache Spark enthousiast and fan of learning tests method. I share all my Apache Spark experiences on www.waitingforcode.com blog.
Analyzing sessions can bring a lot of useful feedback about what works and what does not. But implementing them is not easy because of data issues and operational costs that you will meet sooner or later. In this talk I will present 2 approaches to compute sessions with Apache Spark and AWS services. The first one will use batch and therefore, Spark SQL, whereas the second streaming and Structured Streaming module.
During the talk I will cover different problems you may encounter when creating sessions, like late data, incomplete dataset, duplicated data, reprocessing or fault-tolerance aspects. I will try to solve them and show how Apache Spark features and AWS services (EMR, S3) can help to do that. After the talk you should be aware of the problems you may encounter with session pipelines and understand how to address them with Apache Spark features like watermarks, state store, checkpoints and how to integrate your code with a cloud provider.