Ming Yuan is the technical architect of Cloud and Analytics at Zurich North America. He has extensive experience in designing applications and data architectures and implementing big data applications. At Zurich NA, Ming is responsible for supporting machine learning and predictive analytics initiatives through managed cloud architecture and strategy.
Zurich North America is one of the largest providers of insurance solutions and services in the world with customers representing a wide range of industries from agriculture to construction and more than 90 percent of the Fortune 500. Data science is at the heart of Zurich's business with a team of 70-data scientists working on everything from optimizing claims-handling processes to protecting against the next risk to revamping the suite of data and analytics for the customers.
In this presentation, we will discuss how Zurich North America implements a scalable self-service data science ecosystem built around Databricks to optimize and scale the activities in the data science project lifecycle and integrates the Azure data lake with analytical tools to streamline machine learning and predictive analytics efforts.
Learn how Zurich North America:
- Hi, good afternoon. Thank you for attending our session. My name is Ming Yuan. I am a technical architect working for Zurich in North America. And today, Dave and I, we are going to share our journey of a building an analytical ecosystem on the cloud.
So now turn over to Dave. - [Dave] Hi, everyone. I'm Dave Carlson. Currently, I'm a Solutions Architect at Databricks. Formally, and that's what we're talking about today. I manage the LXE engineering team at Zurich North America. So today, we'll be talking about some of the technical capabilities that has been built out at Zurich and how they support kind of the in to in you know analytic lifecycle and, maybe we'll kind of go into some detail around our meta store as well as some deployment capabilities that have been built out.
So for those that are not familiar with Zurich, North America, we're one of the largest commercial property and casualty insurance companies in the US. And we're part of the larger, like global Insurance Group. So the, the area that I worked in was called Data Analytics, and, one of the things that we wanted to make sure we could do is, really make sure we could bring data driven decision making, to our business. At Zurich, there were three main focus areas that we have, customers, we want to make sure, through like, our data insights, so, we can help our customers better understand and manage their risk. So, the best, losses one that doesn't happen, right. Underwriting we wanted to make sure we could support and enhance the risk selection process, be able to help our underwriters in our business leads, with their program structuring and your pricing decisions, both on account level and also at a portfolio level. And last but not least claims, we want to make sure we can improve outcomes, in the client claims handling process. So make sure, claims that potentially could develop, into more severe claims without proper oversight, we get the right claims professional, maybe you can get like a practitioner assigned on claims that may develop the more severe risks.
And, how do we do that? So, to the right, you'll see, many people have seen some iteration of, the Alex life cycle from like ideation to your actual model build, development, execution, and model monitoring right. So, what are some of the key capabilities that we needed, because data scientists what you need is, you need some fairly technical capabilities. But at the same time, we're not developers for us data scientists, so we want to make sure, things that need to be taken care of, but we don't necessarily want to be aware of day to day basis, like, security, scalability are, taking care of for us. So, some of the key capabilities are like data discovery, I know what I can analyze, and what I can't analyze, I can't analyze what I can't find. So we need to make sure through the tens of thousands of tables, we potentially had access to, that we could easily find and discover the relevant data needed for analysis, and also be able to make sure that once we discovered it, we're actually able to integrate it in a way that we could actually potentially, work with that as yours and, build it out into something we could actually deploy in a data pipeline. Cause previously, we had a proprietary analysis tools that we do the analysis. So we'd have to, like, extract all the business requirements and deploy it. And that could take up, potentially up to like two years, back in the day, when we use proprietary tools that had to be re-factored into other deployed languages and collaboration. Not all of our data scientists, our analysts are co located in the same office. So, we're not all the same city, country, timezone, right. So, we need to make sure that we had a tool set that we could encourage collaboration on projects, on multi person teams, not only multi person teams, but also multi persona teams. So we might have business analysts, data analysts, data engineers, data scientists, actuaries, all wanting to contribute and understand, the source of data, what transformations are being done, are they being done appropriately, being able to try out new features, integrate them quickly. So, we that was one of our kind of requirements as we were building out this ecosystem of tools and all analytics landscape. And last, definitely not least, business impact, we want to make sure that, we could, severely decrease the speed to market rate. We wanted to make sure that, our data discovery tools integrated with our our integration standpoint, the Center for data scientists want to use, common data set science frameworks, in Python and AR, we could easily, connect all of those together and chain them in our deployment pipelines.
And related to that is, multiple types of implementation, sometimes we might be building something to integrate into a calling application. So we might have to build like a REST API. Other times, we it was fine for the business to know, like the next day or maybe once a week, if it's something where it's a triage or workflow type thing, we definitely need to know that instance. And batch generally was easier and quicker to build and deploy over real time integrations. And also, scalability, scalability can mean different things to different people. We need to make sure from an analysis standpoint, we were using tools that allowed us to scale up or scale out analysis, like the days of having data sets that would fit and work on your local desktop or laptop, generally that isn't the case in an enterprise setting. So, we needed to make sure we could definitely easily scale up, scale out, our compute needs. Also for our people perspective, we need to make sure that, we definitely preferred server side web based tools. So that way you don't have to worry about, enterprise setting, your environment, your client environment might be locked down. So we need to make sure that we had tools that could scale out to a fairly large user base. And then deployments, we need to make sure that we were using tools that we could actually take the the work that was done by engineers are data scientists, and be able to quickly move that to a deployment ecosystem. And you're now going to, kind of hand it over to Ming and Ming is going to kind of go into detail around some of our data discovery and integration capabilities. - [Ming] Thanks, Dave. So let's talk about the data foundation. To us, data foundation doesn't just mean the bytes on disk.
Data, scientists need a powerful platform where data can be stored, transformed and analyzed. On the slide, here are a few use cases of the data platform. First of all, the platform should allow data scientists to apply mathematical algorithms to the data and to generate insights. Those insights would in turn help Zurich and our customers to make better business decisions and to drive business impact.
After being curated, that data can serve for many purposes. The platform should also provide an avenue for business users to explore data, for data analysts to find hidden patterns, and for the reporting analysts to create BI reports.
The definition of analyzable data has changed from the incorporate tabular format to all kinds of structured, semi structured or even unstructured. The platform should allow users to mash them up together and make sense of them. Other use cases include, optimizing operational cost, and support security and governance processes. In response to those use cases, Zurich has implemented a data lake on Azure Cloud. The lake concept has been a while has been around for a while. Using cloud services, our lake is not only the simple storage place, but also offer plenty of processing power.
We divide we divided the physical storage into separate layers. Just like water, data flows through those layers and gets purified for the consumer. The layered design aim to meet different needs from different groups.
Each layer is a dedicated network location. And the data files are four. They're organized into folder structure by subject area, source system and the timestamps. Spark cluster is the computation engine pushing data through those layers.
The cloud provides multiple options to economically implement a data lake architecture. We store data files on Azure ADLs, and the use of data breaks pass offering. ETL pipelines are managed by an in house developer framework. Currently, our data lake platform has more than 900 terabytes of data and perform more than 40,000 transformation tasks each day and those numbers are still growing.
The next topic is about Metadata. Metadata is a shorthand represent representation of a data. It makes finding and working with instant server data easier. Altering the size and complexity of our data lake, a platform support the metadata management became critical.
We defined an admin role to collect and manage metadata entries on the platform. The other end of a platform is the user. They use the platform to search, view, use and share metadata. We expect our metadata Management Platform to facilitate all of those functions.
After carefully evaluating several data catalog products, we chose Alation as the metadata management platform. Using the built in connectors, we quickly integrated Alation with the data lake plus 18 on prem data repositories. The synchronization with source systems are automated and managed by Alation so that metadata changes in those sources can be rapidly reflected. Currently, our Alation platform has more than 2 million metadata records covering 1.5 petabytes of the data.
Alation's UI is easy to use. Metadata in for is well organized, organized in web pages.
We have enabled the social collaboration features such as google search, endorsements, object watching and alerting. A built in SQL editor will allow data scientists to quickly pass ad hoc SQL queries against any source system. All this front hand functions makes the data catalog, a self service portal of the entire data ecosystem. Now, I'm turning to Dave to talk about the next platform.
- [Dave] Thanks Ming. So as Ming kind of talks about our data discovery, we wanted to make sure as we we were looking through our data catalog, and our, our data scientists, analysts, found additional information about a given data source, be able to contribute that back, but also then be able to pull that into an overall ETL pipeline training as well as for scoring, right. And, the way that we did that, we leverage multiple tools, which, kind of have a flow, we'll kind of go over in a bit. But we wanted to make sure that we eliminated a lot of the silos that we had historically where our data analysts might want to use a SQL based tool or people who really wanted to use our, would use like our studio or, we would have people that maybe had a smaller data set, they want to use Python, they might just be using local Jupiter. Well, what would happen is, on larger projects we would have different portions being worked out and analyze separately. They might copy over the resulting data sets, and integrate them together, but we kind of lost potentially on some projects, that full in the end, at least from an ease of integration, right. So, to the extent if we had to stitch this into a common pipeline, we had the once again, refactor all of that into a common pipeline, which really slowed down and have tremendous impact on our speed to market. So, we want to make sure that we really could help those multiple personas that we talked about before, really work together and integrate into an overall pipeline.
So this is just like a really high level view and flow of, that hopefully one thing about this is like the integration. We needed to easily facilitate the integration of various data sources, like as Ming said we had a cloud data lake, but we still had a lot of on prem historic data sources that maybe they're Netezza, maybe they're Microsoft SQL, Postgres or what have you. And we didn't want to have our data scientists be wasting cycles on tracking down, not only the right drivers, but also the right credentials. I'm sure this only happened not at Zurich, everyone's done it where, they accidentally maybe put secrets into like a code repository or something like that. We didn't we wanted to make sure that the right way, it was the easy way, right. So, we leverage as our front end integration tool Dataiku. And that kind of allowed, people who wanted to work with like a code less code light type environment, or even sequel or AR, push scale of computation out to a Spark layer, allowed us to build that out in a way that we capture all those business rules. But we also were able to then take that and bundle it and test and push it out to production. And, you know our timelines, went down from, 10s of months to a matter of months which compared to what we we're used to was, was huge to our business, and also helped us support like, that multiple type of deployment where to do the initial business impact, you might do a more simple deployment batch deployment, and then work on maybe the more robust or the the real time type deployment solution. And, as far as like other options of deployment Ming's going to be covering that through our use of containerization and CICD processes.
- [Ming] Thanks Dave. Let's switch gear to modern deployment with containers. Containerization is a lightweight virtualization mechanism. A container can be easily created from a manifest file. And precisely encapsulate the application with his own runtime environment.
Container, containers helps model applications to reduce the system, the system footprint and the resolve potential conflicts between machine learning libraries. The container platform also naturally support elaxicity ela sorry, the container platform also naturally support elasticity. The instances can be rapidly provisioned or de-provision to adapt to the workload changes. This is really useful to economically maintain the right system capacity. Containers are a multiple packages and can run virtually anywhere. So we can easily migrate a containerized application from development to QA to prod.
Along with the container, we also had microservices. This term referred to an architectural style a set of a small services communicate with lightweight mechanisms. Each service run in its own process, and the can be independently deploy in a task. Containerization and the cloud services naturally support this architecture. We see microservices is a good fit to implement scoring engine API's. As Dave mentioned before, the diagram on the slide is our reference architecture where we tied microservices, containers and the cloud together.
DevOps is a methodology, where development team and the ops teams work closely from design to development to production support. This collaborative approach promotes the productivity so that IT teams build, pass and then release software faster and more reliable.
While working our data platforms, we have successfully adopt, introduced the double principles to the field of a data and analytics. The data scientists and the IT teams start collaborating as soon as a new project kickoff. Push button deployment enable us to iteratively release new versions, and even business users can also participate in the process. They could attest and give feedback right after each release. This is a short feedback loops help not only the data scientists to adjust model definitions, but also business users to understand and predict the business impact.
To support devOps processes, we integrate a set of tools with other platform. The diagram depicts a sample pipelines commonly used across our projects. When a new modeling product begins, such pipelines along with system components will be created. Data science, developers of engineers can then use those tools to collaborate during the course of the whole model development lifecycle. To summarize, we started our journey with streamlining the phases in the model development lifecycle, and defining main use cases at each phase.
Those use cases, jove the requirements to build the analytical ecosystem. We evaluated many analytics tools on the market. We found each of them has its own strengths, priorities and the vision. However, we have only selectively adopted the platforms that meet our requirements, aligned with our processes, and the match with our skillset. But more importantly, we paid great attention to integrating individual projects into a holistic ecosystem.
Data scientists data science, AI, and machine learning are the fields evolving really quickly. We are open to ride the new waves and continuously improve the technical landscape in our current processes.