Modernize Your Data Warehouse and Data Lake to Databricks Delta with Informatica

Modern enterprises depend on trusted data for AI, analytics, and data science to drive deeper insights and business value. With intelligent and automated data management, you can take advantage of Databricks Delta to gain efficiencies, cost savings, and scale to succeed. In this session we will discuss how customers can modernize their on-premises Data Lake and Data Warehouse to a modern architecture centered on Databricks Delta that not only increases developer productivity but also enables data governance for data science and analytics.


 

Try Databricks

Video Transcript

My name is Rodrigo Sanchez Bredee and what I wanna talk to you here about today is how to modernize very quickly, you’re a on-premise or fully managed data lakes, and potentially even data warehouses into Databricks Delta by leveraging Informatica. Let’s actually start by talking about what the motivation for all this is. I have experience actually building machine learning products, reinforcement learning algorithms, in the smart thermostat industry and it is a hard problem. Building out that code, finding the right model, it’s in itself about a really difficult problem. However, most of the effort actually doesn’t go into the actual machine learning code itself. Historically, what we’ve seen is that it’s only really the largest companies are able to create these successful projects, because it turns out that a lot of the difficulty comes in actually procuring the data, making sure you have the right data sets, that you have the right infrastructure built around your model that then allows you to essentially have that success, find that successful implementation. So it’s about collecting, verifying and preparing the data, configuring and managing that cloud skill infrastructure. That’s really where the heavy lifting is. And so that’s why we’re seeing that the difficulty really comes with the fact that, that procurement of the data, forming of the data, making sure that it’s ready for the machine learning projects actually winds up, meaning that data scientists spend up to 80% of their time preparing the data, finding the data, getting it, wrangling it to be in the right way in a far less time modeling. Of course the percentages vary depending on industry, but generally speaking, it is a huge time thing for data scientists and this is something that we’re trying to help you automate and make a lot faster, because we’ve been in the data management business for a long time, and we know how to do this very well at scale. The other thing of course is that what we’re seeing is that the volumes and the types of data, the variety of data is increasing exponentially, which also essentially doesn’t mean that the on-premise data is going away, if anything, it’s also growing as well. So you need to have the ability to have a two hybrid approach that allows you to essentially deal with the ingestion of the data from both on-premise applications, streaming sources, cloud sources, weblogs, et cetera, et cetera and you actually need to be able to do this at scale and very, very quickly. And once you start actually getting into scale, you’re gonna run into the problem of risk management. You wanna make sure that your provisioning data that is properly governed. That it actually is the right data that is actually fit for use and how you determined that it’s fit for use. So you need to have data lake governance through cataloging and lineage, along with process governance. I mean, how do you decide what gets moved into Delta? How do you decide who actually has access of the data in Delta? These are actually non trivial questions that you need to make sure that you’re able to answer. So you need to essentially build trust of this data that you’re provisioning into your data scientists. And finally, today, a lot of data engineers wind up spending a lot of time maintaining a lot of code for what essentially is repetitive data pipelines for ingestion and processing. So you wanna really automate a lot of that, essentially use a visual zero coding approach that is metadata driven that actually allows you to understand that they need to allow you to understand the health of your platform very, very easily, rather than spending a lot of time reinventing the wheel, trying to recode everything else, sorry, code everything by hand. And finally, you also want them to use the advantages of cloud, right? You don’t wanna spend a lot of time capacity planning for essentially fixed resources on an on-premise cluster. You wanna actually leverage the limitless capacity for compute and storage that you have available in the cloud and essentially scale up and down, scale out or in, depending on the workload, without having to worry too much without spending a lot of time negotiating resources, in doing a lot of capacity planning for that. So today what I wanna get into is a little bit more detail about sort of these three legs of the stool that we’ve built together with Databricks. I’m gonna try and take your guys down from the high level, 30,000 feet to 10,000 feet to ground level, about some of the things that we did together. And right now, as I said before, we think of it as essentially being a three legs stool, where we first help you manage the data science life cycle, making the data easy to discover, but also help you with ingestion and finally help you build these data engineering pipelines

in a very efficient way. So let’s talk about ingestion first. So when you start starting your data research project, your Delta Lake is empty, you need to fill it, you need to manage it, you need to govern the processes. And so Informatica can help you easily ingest data from various cloud and on-premise sources, whether they are applications, databases, files, streaming sources, you can move this data into the Delta Lake. We’ve really simplified the UI where it essentially, it’s a very easy configuration, couple of steps, maybe two or three screened and you’re up and running. You’re moving data at scale into Delta.

We’ve tried to really get you sort of all the power of the Informatica platform in a very easy to use UI again for ingestion. Just get up and running, get the moving data into a Databricks quickly. But that’s not enough, once you have the data in the leg, you need to transform it, you need to enrich it, you need to cleanse it, you need to have a way to do all these things in a repeatable way, in a way that doesn’t result in a lot of manual work. So you wanna automate as much as you can. And so what we do essentially is give you this UI, this ability to create all these data flows in a very simple drag and drop format without having to actually write any scholar code or SQL code, and essentially just submit it and push it down into Databricks to run on either AWS or Azure. So this data driven metadata, sorry, driven approach, supports just better manageability of your logic. It gives you the ability to upgrade to new versions of Databricks Spark without having to change your code and your APIs. It is DevOps ready, it is multicloud ready and also supports your operationalizing, sorry that word always trips me up. It allows you to operationalize your models, your algorithms, things such as models that you may have built in Python, you can actually embed them into this framework and have it run as part of your data engineering pipelines. So it’s a truly agnostic and multicloud approach. But all of this doesn’t work if you don’t have the right level of governance and the right level of discoverability. And so we support this data science lifecycle by using our enterprise data catalog for two end to end lineage, so that data analysts, data stewards, data engineers, are able to trace the data lineage from the original source all the way down to the Delta table, to the then engineering pipeline and to the predictions into the machine learning code as well. so you have this Holy Grail of this end to end visibility into how the data is moving into Delta, how it’s being transformed in Delta, who’s doing what, and so that when the next round comes in, of iteration, of improvement in all these data pipelines, you don’t have to spend a lot of time discovering it, it’s all very easy to access and to view. And our catalog itself is embedded with AI, with our engine that we call CLAIRE that provides you advanced data discovery lineage profiling data accuration. It integrates what a data governance and self service data preparation, right from the catalog. So essentially what we ultimately want to do is provide you with an ability to essentially, without any code, with minimal operations, with no limits on the data, very quickly get your machine learning projects, your data science projects and analytics projects, to be successful at scale. And I wanna show you some specific examples, let me take you down one more level. This is just a screenshot of what’s possible. So you see here on the upper left, what we call a mapping, which essentially is our representation of data flow. It’s very simple drag and drop interface where you move all these elements into a canvas, and then you create the connections between these different operations, that we call transformations. But I wanted to illustrate here is that the mapping that you see there on the upper left, if you had to hand code this, you would actually spend a lot of time writing the code that you see there on the right, if you wanted to use a Scala or

Java for Spark, sorry. Or if you wanted to even write in SQL, you can see it’s not trivial. However, if you wanted to make sure that the logic actually is sound, understand that what it is that this code does, in a visual designer, it becomes very, very easy. And more importantly, as I said before, as the engine underneath evolves, and you have revisions, we actually take care of that. So essentially we take care of taking that mapping and converting it to the latest and greatest version of the Databricks engine and submit it as a scholar job underneath the cover so that you get the best of both worlds. You get the ability to have a very fast, very robust delivery of business logic, as well as leveraging the best best in class characteristics of the Databricks engine, and more importantly, we also help you with the operation of it. So this is an example of how you can create the workflow, and orchestrate all the way from invoking the Databricks service, setting up the cluster, running the jobs, verify that the data made its way correctly and at the end kill the cluster so that you actually don’t incur costs for unused capacity. So we help you both with the design, as well as the operationalization. And as I said at the beginning, one of the other interesting elements that we have is the ability to do mass ingestion. So you can ingest data from various sources, including relational tables into essentially the Delta Lake, with minimal interfaces. All you have to do is point to the source, describe if you want to do any minimal transformation, maybe some filters, maybe some tagging, and essentially load the data including the ability to do accept changes to the scheme, et cetera, as you move the data into Delta, but also move massive amounts of data in the form of files, or also from streaming sources. And we essentially, what we wanna do is give you the ability to do this very quickly, but also have the orchestrations necessary. We also are a big believer, actually, before this role, I was a product manager for our streaming products and using the exact same interface that I was mentioning before to build your data pipe, that you can also build streaming type of pipelines. And we actually, were one of the first out of the bat to support Spark Structured Streaming way back in 2.31, and we keep supporting it. We have actually the ability to support windowing based on events time. You can actually define a watermark for native and handling. So we’ve taken all these great characteristics of Spark Streaming and made it available in the same easy to use interface so that you can build both your batch, as well as your streaming products in a single interface, and make them work for you. And this is actually where you can also operationalize your machine learning models by essentially embedding them, inside of data pipeline and so you get the possible works, actually the same place

that you use to build your data engineering pipelines, that you can actually use to operationalize your machine learning models. And the third leg of the stool as I mentioned before is our catalog. So with our catalog, you’re actually able to create a map. I’d like to think of it as a Google Maps of how your data is flowing into Delta, how it’s flowing inside Delta, if I go to the next slide, and essentially what the impact is of all these changes as you’re moving the data, so that when you’re a data scientist trying to find the right data set to use

in your work every day, it’s a lot easier to understand the validity and the fitness for use of the data by looking at the catalog. And so if I wanted to paint a picture of what it looks like in, sort of the ideal architecture here, Informatica’s vision essentially is to deliver the most comprehensive end to end best of breed solutions, for the complete modern data integration and big data problems. From ingestion to preparation, to cataloging, to securing, governing, and accessing the data, Informatica delivers all the critical capabilities for modern data integration and Delta Lakes. And so this end to end solution completely supports Databricks and we work hand in hand with the Databricks team to make sure that as they bring innovations to market, you can actually leverage them using Informatica products. Make no mistake, trying to do this all on your own is just not productive, repeatable, or scalable. You really do need a comprehensive end to end solution powered by unified metadata management and this machine learning based brain that allows you and your teams to get out of complex manual site or efforts of integrating data and start putting all your data assets to use immediately. And so just to round it out, so what have we learned today? Well, you know, just like Databricks has a single platform from doing data science and then engineering, Informatica is a single vendor for the three integrations for ingestion, for data transformation and governance. And if you wanted to talk about, okay, what do you need in order to build successful AI and ML projects, and really fulfill the promise of those on-premise data lakes and modernize them into Delta, you need to be able to, as we talked about, you need to be able to find and discover the data using a catalog and an integrated governance of that Delta Lake. You need to be able to accelerate the movement into that data lake, you need to be very nimble, very agile, and as you find more sources very quickly move them into the lake. You need to be able to prepare and enrich the data from all the available sources before you start to model. You need to make sure that data’s in the right form and you need to create this data engineering pipeline in a way that’s easily scalable and very repeatable, so again, you spend most of your time on the machine learning models and less of your time in the data pipelines. You also wanna increase your productivity of the data engineers again, by using this no code, no ops, meta data driven approach and finally, you wanna go serverless. And this is where our partnership with Databricks is so important to us, because essentially you created a pipeline from Databricks to take advantage of the increased flexibility, performance, and no limits from data that they offer. And we do have examples that are public, Takeda Pharmaceuticals. They had a lot of data functional silos because they had multiple platforms from a merger between Takeda and Shire. They had a static Hadoop based clusters that limited their data volume and required manual and time consuming management, which was actually slowing down with in their innovation and their speed to implementation, because it was really hard to drive that collaboration. What they wanted to do essentially is also lower the high cost of resources because their compute did not scale up and down in response to the low. They also had a lot of a high cost to the repeat purchase of data sets, because again, it was scattered all over the place and the integration of Informatica and Databricks with Delta Lake on AWS has resulted in 75% faster cluster creation, about 10 X faster execution on benchmark queries, from 28 minutes, in some cases to six minutes, so order of magnitude faster. On average, their improvement speeds increased by 30% and essentially their productivity also increased the magnitude by 30 to 50% by just having faster availability, having the ability to execute a lot faster, which in the end is all resulting in significant cost reduction and performance improvements and significant cost savings by re existing all these data pipelines that they have built using Informatica abstractions in their on-premise data lake and moving it into Databricks Delta. So I wanna finish off by saying that we are working hand in hand with Databricks as they offer these migration solutions, to move your on-premise data lakes and some customers are even talking about moving their on-premise EDW, Enterprise Data Warehouse, into Delta. It’s a very strong technology offering and we wanna do essentially is help you accelerate those efforts, but also reduce the risk associated to those efforts. You kinda get the best of both worlds when you do that. So I hope this was useful to you, I hope this was interesting to you. We’d love to get your feedback, of course. Please rate the session and I look forward to hearing more and also I hope you enjoy the rest of the session here at Spark + AI summit. Thank you very much.


 
Try Databricks
« back
About Rodrigo Sanchez Bredee

Informatica

Currently the Sr. Director of Strategic Ecosystems, Big Data, he is responsible for driving strategic alignment between Informatica and Databricks, among others. Previously Sr. Director, Product Management - IoT at Informatica where he was responsible for setting the direction for its real-time and streaming analytics capabilities. Before Informatica he led the product management function responsible for the award-winning Home Energy Management solution from EcoFactor, Inc. His experience also includes stints at HP WebOS and Amazon's Lab126, as well as in the optical sensor, hard disk drive and semiconductor equipment industries. He holds two M.Sc. from MIT, in Management and Engineering, and a B.Sc. in Mechanical & Electrical Engineering from Monterrey Tech.