Gustav is a developer focusing on big data and data science. Currently he is working in a research project setting standards for how to work with fleet telematics data at Scania. Gustav has over 20 years of experience in creating IT-solutions – always in a role that has included hands-on programming.
At Scania we have around +400 000 connected vehicles, trucks and buses, roaming the Earth and continuously collecting and sending us GPS coordinates globally. To create value based on this data we need to build scalable data pipelines enabling actionable insights and applications for our company and our customers. Building scalable big data pipelines that achieve this require constant improvement work and transformation. We will bring a talk on how we've architected a continuous delivery Spark streaming pipeline that we iteratively improve on by refining our code, our algorithms and the data deliverables. The pipeline runs on our data lake using Spark Streaming, stateful and stateless, and the data products are being pushed to Apache Kafka. In the build pipeline we use Jenkins, Artifactory and Ansible to test, deploy and run. We will present the technical architecture of a pipeline and the major changes it have gone through, what triggered the changes, their respective implementation and adaptation and what it taught us.
Today we deploy new generations of our pipeline with just a mouse click, but it has been a journey getting here. We believe that people with general awareness of the challenges of big data, the possibilities of the streaming paradigm and the need for CI/CD will find this talk enlightening. Topics will highlight: