Evolution is Continuous, and so are Big Data and Streaming Pipelines

Download Slides

At Scania we have around +400 000 connected vehicles, trucks and buses, roaming the Earth and continuously collecting and sending us GPS coordinates globally. To create value based on this data we need to build scalable data pipelines enabling actionable insights and applications for our company and our customers. Building scalable big data pipelines that achieve this require constant improvement work and transformation. We will bring a talk on how we’ve architected a continuous delivery Spark streaming pipeline that we iteratively improve on by refining our code, our algorithms and the data deliverables. The pipeline runs on our data lake using Spark Streaming, stateful and stateless, and the data products are being pushed to Apache Kafka. In the build pipeline we use Jenkins, Artifactory and Ansible to test, deploy and run. We will present the technical architecture of a pipeline and the major changes it have gone through, what triggered the changes, their respective implementation and adaptation and what it taught us.

Today we deploy new generations of our pipeline with just a mouse click, but it has been a journey getting here. We believe that people with general awareness of the challenges of big data, the possibilities of the streaming paradigm and the need for CI/CD will find this talk enlightening. Topics will highlight:

  • What each deployed generation of our pipeline have taught us about Spark Streaming and Kafka.
  • How we utilise the power of DevOps tools.
  • How we enable and ensure the delivery of quality assured data products at scale.

Watch more Spark + AI sessions here
or
Try Databricks for free

Video Transcript

– Hi everyone, my name is Sarah and I’m coming to you today from Scania Headquarters here in Sweden together with my partner in crime, Gustav.

We’re gonna run a talk today called, Evolution is Continuous, and so are Big Data and Streaming Pipelines. So we have planned an agenda.

(laughs) Thank you.

So the thing here today is that I’m gonna start us off by telling you a little bit about us, and about Scania, and some of the key challenges that we are facing. And I’m gonna hand over to Gustav who is going to give us like a deep dive into how our streaming pipelines looks today, and how we deploy these and just with the click of a button.

450000+ connected vehicles

So what is Scania then? Scania is a world leading provider of transport solutions. This includes a factory where we provide trucks and buses for the heavy transport applications, but we also deliver an extensive product related service offering. We are a global company and we operate in all continents of the world and just to sort of give you some visual to this statement here in the presentation background we can all over our connected vehicles each represented by one point in the map. So the purpose that we at Scania live by is that we want to drive the shift towards sustainable transport system, meaning that we want to develop safe, smart, energy efficient transport solutions that are better for people and the planet. And of course technology is super important to us all the way from improving our in core processes to creating new business opportunities and this is really where connectivity act as a key enabler for us because it allows for the control and coordination of entire systems to target these enhanced efficiency as well as reducing CO2 emissions significantly.

All right, so hitting these targets is very much about making information and data accessible all out in the entire organization as well for our customers, so one of the key things that we need to fix is that we need to process these large volumes of connected product data because transforming this data into operation and visualizing that, this is where our consumers can actually utilize this to make informed decisions. So this is the mission of our team. It’s about creating actionable insights from data provided by our connected fleet. So just to sort of highlight some of the use cases to give you some context into what our team is trying to solve.

Can we optimize our transport flow?

One where we use life information that we base on our geospatial data is in the optimization of transport flows. So the question that we need to answer for our customers and consumers is how arrival time and duration are related for a given transport hub.

So as we can see here in the presentation the small little image is representing all the geospatial data for a very small volume box and a very limited time window, but only looking at that is still very large dataset and super complex to understand by just looking at it. So it doesn’t really give you any hint into how to optimize this area. So we need to transform this raw geospatial data into something that we today call stop applications, that gives us the capability to build dashboard system that we can use to answer questions about when does vehicle stop and how do they stop and where does the stop happen, because in this way our customers can sort of look into how the execution of the transport related to their schedule and plan. Because transport is very much about having a vehicle up and running and moving from A to B and not standing still. So as we can see here in the highlighted box that I’ve tried to highlight, we can see that before lunch actually we have some deviations when it comes to vehicles that are standing still more than a transport at the same hub going in and out in the rest of the hours of the day.

So I told you that one of the things that we at Scania really hold dear to our heart is that we want to drive the shift towards a sustainable future and one application that we are showing here that we built is a recommendation system for our customers for how to turn their diesel truck fleet into electrical one. And by examining these historical transports that we saw previously we can derive heuristics for placement of charging infrastructure and thereby determining a potential placement of charging infrastructure we can do battery simulations to establish needed battery configuration and this in turn let us create a cost and CO2 optimized commercial plan. So very much about helping our customers to be sustainable while keeping their fleet running in the same operations that they are doing today.

Okay, so now I’ve tried to sort of shed some light into what data we are using and what we wanna solve, but however we decide that we want to face our team’s mission, the truth is that there doesn’t really exist one final solution that serves all our current and future needs. Because our solutions, they need to evolve with the challenges that they aim to face and the fact is that it’s not only about what we build, but very much about how we build it. It’s about delivering insights for incremental improvements lies at the core of what we do, because working with these infinitely growing datasets and continuously providing new, valuable insights and still assure that our own capabilities to improve doesn’t sort of fall behind is only possible when we’re able to work in cycles, meaning that we need to be able to develop things. We need to collaborate with other stake holders and consumers and see what we can improve and we need to keep that in sort of tight loops. And we’ve been trying to do this for the last couple of years and today we’ll share some of the learnings about pipelines that we have.

All right, so the first thing that we learned pretty early on was that we needed the ability for our solutions to scale within the data, meaning that we want the ability to scale out when the data scales up and for those past jobs simply won’t do it, so the pipelines that we put in production today typically consists of a series of processing step where we have implemented each step as a separate Spark job and when we utilize the streaming pattern we’re utilizing either Spark streaming or Spark structured streaming. Both reading from Kafka and writing Kafka.

And having these type of processes as a series of streaming jobs typically enables us to have a transparent, modular, and scalable solution, where it’s not like a black box that you have to work with, but we have rather built a machine where we have the possibility to monitor the entire pipeline’s progress or we could even pause the entire pipeline or we could rerun it if we want to or even stop it to replace some parts while at the same time we can work with the transparency at the level where we can set sizes of our transformation in relation to resources as well as customer needs.

So configuration and deployment of pipeline data is a complex task and while running only one Spark application can be as simple as running a Spark submit, our need is more of a reproducible strategy for managing all these credentials, input and output and checkpoint paths and do that for multiple environments because running these as a series of processing components, doing that in a manual setting and creating multiple data products for different consumers, it would sort of quickly become a game of herding cats. So we really need this simple strategy to make sure that we manage our inputs, our outputs, and the checkpoints of each component and do that as well for different environments such as development test end production per component, because when we develop and subsequently test changes in our components, we don’t wanna interrupt anything that we are already delivering to our consumers, but rather we would like to write outputs of test components to test topics and there we’re actually entirely free to improve on what we deliver.

All right, and one super cool thing about working with Spark streaming is that it comes with this internal process for fault tolerance and thereby ensuring that the Kafka sets and the, I mean for stateful application the states are written to HDFS, but one problem with this fault tolerance system that we are utilizing today is that when we have an improved component, when we have an improvement that we want to deploy it loses its ability to pick up these states. So what we need to do is to ensure that our updated components can continue through the processing where a previous component actually left off. So it’s very much about building the ability to work as a pit stop where we can improve on a component, close it, and then we can send out our champion back in this race and we don’t have to restart from the very beginning, but we don’t have to lose anything in between switches either.

One really recurring need in our team is that we have to decide on what to improve, how to improve that, and communicate the results of those improvements to the rest of the stake holders and other teams that we’re working with and this is where unit testing and integration tests of entire pipelines is key for producing quality metrics that can guide these improvements, but they also very much work as the facilitator to communicate with our stake holders on what we actually did improve.

So just as customizing packaging typically takes place somewhere after final assembly in a factory, we also need customizable post processing to ship our data products to consumers, ’cause with multiple customers that come with varying questions most typically we actually can use case specific post processing because it allows us to utilize the same data product to answer multiple business questions, so in the use cases that I just presented earlier, the two ones, remember? The one with the transport optimization and the other one with recommending electrical trucks and the conversion for those, we actually utilize the same output of the same streaming pipeline, but in order to do that we have use case specific post processing pipelines that take these final steps to actionable insights. – Yeah, thank you, Sarah. Let’s continue with the basics. By our definition, a component is self-contained runnable code. Self-contained in the sense that it automates its own setup with an environment specific configuration. Given a component name, there is also an established mechanism for fetching the component and in our case, this mechanism is implemented by a binary repository. Self-contained, automates its own setup, so fancy automation then? Nice. No, it’s more than that. These are the properties that the enables us to place various programs under the same abstraction. It enables us to have the basic building block, just like Lego has the Lego brick.

Our components are either Spark jobs or Spark streaming jobs, but the abstraction and the technique we use to uphold the abstraction can be used for any type of program. To be more specific, our components are packages containing the Spark program and mechanism for installing, configuring, and running the program, configuration inventories, that is, sets of configuration parameters specific for each environment. We use Ansible for automating the tasks needed for running a particular component.

Let’s have a closer look on Ansible and an Ansible playbook. It is actually rather self-explaining. We need to copy the program to a place where we can do the Spark submit. We need to set up authentication and finally we need to start the program, in this case with a Spark submit. As you can see, this can be expressed in an easy to read, declarative way using Ansible.

Component

Now we talked a bit about running the component and we need a process for creating the component as well. A component is built from a repository by Jenkins. This means that whenever a component is updated by a code commit, Jenkins will check out the component, build it, test it, and publish it to Artifactory. On a side note, we have an established structure for our projects enabling us to generate the Jenkins jobs from code.

With a component published in the binary repository the next level of testing begins. We verify that we can run the component in a production-like environment and we also make sure that it’s compatible with the other components in the processing pipeline.

This leads us to our next abstraction, the release. By our definition, a release is a compatible set of components. Compatible in the sense that they are tested together and that the outcome is according to expectation. In other words, a release is a set of components that are verified to be compatible with each other.

This verification is performed by the integration test. The integration test fetches and runs the components in a defined order and then it verifies the intermediary and the final results.

In this way, we can verify that the particular combination of version components are compatible and that they generate the expected result.

Here is an example of a release, and as you can see, it can be implemented as a simple JSON file.

Release

Just like with a component, there is an established mechanism for fetching a release and this is also implemented by the binary repository.

We have now explained what we mean by component and release, and they build the foundation for our final abstraction, the pipeline specification.

The pipeline specification is a version-handled specification of all active pipelines. What release they use, additional configuration, and order of execution. With this, the production system has all needed information for running the pipelines, regardless of orchestration mechanism.

The pipeline specification describes both what and how. It combines the release with an execution schema. Also, the pipeline specification allows for different versions and types of pipelines to be defined. So why do we think this is so cool then? Because it allows us to dynamically deploy pipelines just by updating the pipeline specification.

Just like with Lego and the Lego bricks, we can combine our components and swiftly update or create new processing pipelines.

As you can see, amazingly simple, but still very powerful. With the pipeline specification, we have a clear interface to the production environment. In our case the pipeline specification is used to generate our flow jobs that are responsible for the execution of the pipelines in production.

Instead of deploying on ground using Airflow, we could use the same specification and for example, generate the cloud formation stack for deploying and running our pipelines in AWF.

Putting it together

So if we put component, release, and pipeline specification in the context of how our team works. Let’s imagine our team getting a request for adding a picture column to one of our data products. We make the change locally at one of our developer desktops. When we’re happy with the change, we push it to Git. Our Jenkins server picks up the change, pulls the updated repo, compiles the code, runs the unit test, and if they are successful, packages the component and publishes it to Artifactory.

This will in turn trigger an integration test pipeline that will test the latest versions of the components together. If the integration tests are also successful, a release will be published specifying the components, just as we saw an example of in our previous slides. When we have a pipeline that is ready for production we add the new pipeline to our specification, specifying the newly created release, the execution schema, and additional configurations.

The pipeline specification is finally used to generate the Airflow jobs for running the pipeline in production. This enables us to take responsibility for delivering what our customers need within the timeframe they need it, while at the same time ensuring that we are accountable for the code we produce and deploy.

The final pipeline doesn’t exist. What customers get answers to their questions, their questions changes. We need develop, test, deploy loops. Think end to end. Put things in the hands of customers and their use case to get feedback on what should be improved. We should avoid premature design decisions. Trade offs must be done in the context of customer needs and priorities. Value creation is not a one off effort. It’s a continuum and our processes need to allow for that.

It has been interesting for us to summarize our experience

and we hope you will have use for them in your current or future endeavors. Feel free to reach out to us with any ideas, thoughts, or reflections.

Watch more Spark + AI sessions here
or
Try Databricks for free
« back
About Gustav Rånby

Scania CV

Gustav is a developer focusing on big data and data science. Currently he is working in a research project setting standards for how to work with fleet telematics data at Scania. Gustav has over 20 years of experience in creating IT-solutions - always in a role that has included hands-on programming.

About Sarah Hantosi Albertsson

Scania CV

Sarah works as a development engineer in the Connected Intelligence team at Scania's R&D section. A team who create innovative and scalable data products to ensure actionable insights. Together with her team she finds the technical architecture and implementation to realise business opportunities. Whether that's implementing big data pipelines, tweaking Kafka throughput, optimising algorithms utilising Spark or building data visualisations for logistics optimisation. With a degree in Cognitive Science she is at her best where she can combine her interest in people, service design and technology. Where cross-functional teams and working end-to-end is part of her daily routine.