Intro to Delta Lake

May 26, 2021 03:15 PM (PT)

Download Slides

Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.

In this session watch:
Himanshu Raj, Senior Product Manager, Databricks
Brenner Heintz, Technical Product Marketing Manager, Databricks
Barbara Eckman, Senior Principal Software Architect, Comcast

 

Transcript

Himanshu Raja: Hello, everyone. I’m super excited to be here and talk to you about Delta Lake and why it is a right platform for data analytics in your organization. In today’s session, I’ll cover the challenges of building data analytics stacks, why Lakehouse is the only future proof solution and why Delta Lake is the best foundation for your Lakehouse. Brenner, will then jump into the more exciting part of the session and show you some well established capabilities of Delta Lakes, such as schema enforcements, time-travel upstarts, performance optimizations, such Z-order. I’m also excited to have, Barbara Ackman, who is a senior principal architect at Comcast, joining us to talk about how Comcast is using Delta Late to do analytics at scale. After this session, you will have enough contact links to the supporting material to get started and build your first Delta Lake.
So let’s start with taking pulse of the situation today. Why does it matter to have the right architecture? Every company is feeling the pull to become a data company because when large amounts of data are applied to even the simplest of business problems, the quality of insights and improvements on use cases is exponential. And our entire focus is on helping customers apply data to their toughest problems. Let me tell you story of two such companies, NHS and Nationwide. NHS using Databricks in the UK is able to ingest and secure over 4 billion rows of clinical GP data. They analyze hospital data to provide a virus spread prediction model to optimally allocate resources right down to ICU bed and staffing demands at individual locations. NHS is also using Databricks to support contact tracing efforts for over 60 million UK citizens, ingesting COVID antigen and antibody test results datasets and returning insights back to then NHS test and play service with patient contact details for contact tracing.
Nationwide, one of the largest insurance providers in the US saw the explosive growth in data availability and increasing market competition was challenging them to provide better pricing to their customers. With hundreds of millions of insurance records to analyze for downstream ML, Nationwide realized that their legacy batch analysis process was slow and inaccurate providing limited insights to predict the frequency and severity of claims. With Databricks, they have been able to employ deep learning models at scale to provide more accurate pricing predictions and resulting in more revenue from claims. Because of this potential, it’s not surprising that according to a report published by MIT Sloan Management Review, 83% of CEOs say that AI is a strategic diary or that Gardner predicts AI regenerate almost $4 trillion in business value in only a couple of years, but it is very hard to get it right. Gartner says, 85% of the big data projects will fail.
VentureBeat published our report that said 87% of the data science projects never make it into production. So while some companies are having success, most still struggle. Why is that? Why is it so hard to replicate what big tech companies are able to do to generate massive value out of the data. To understand this, let’s take a step back and understand how data analytics in most companies, not like today. It all starts with a data warehouse, which it is hard to believe will soon celebrated its 40th birthday. Data warehouses came around in the eighties and what purpose [inaudible] for BI and reporting. Over time, they have become essential. And today every enterprise on the planet has many of them. However, they want to build for modern data use cases. They have no support for data like video or audio or text, data sets that are crucial for modern use cases.
The data had to be very structured, queriable only with SQL. As a result, there is no viable support for data science or machine learning. In addition, there is no support for real-time streaming. They are great. Databases are great for batch processing, but either do not support streaming or it can be cost prohibitive. The so-called modern cloud data warehouses have tried to address some of these challenges times to the near and finance scalability of resources provided by cloud service providers such as EC2/S3 AWS and VMS and storage accounts on Azure. However, the cloud data warehouses have done that with limited success with a huge sticker shock for many customers down the road. But most importantly, because cloud data warehouses are closed and proprietary systems, they force you to lock your data in so you can not easily move data it around. So today, the result of all that is that most organizations will first store all their data in blob stores and then move subsets of it in Data warehouses.
So then the thinking was that potentially the Data Lakes could be dancer to all these problems. Data Lakes came around about 10 years ago and they were great because they could indeed handle all your data. And they were there for good for data science and machine learning use cases. And regardless of use case, Data Lake serves as a great starting point for most enterprises today. Unfortunately, they are not able to support the data warehousing BI use cases. Data Lakes are actually more complex to set up than a data warehouse. Our warehouse has lot of familiar support semantics like asset transactions, with Data Lakes, you are just dealing with files. So those abstractions and nice things are not provided for you. You really have to build them yourself and they are complex to set up. And even after you do set it up, the performance is not great.
You are just dealing with fines. In most cases, customers end up with lots of small fines and even the simplest lead query will require you to list all the files. And then lastly, when it comes to reliability, they are not great either. Now that we actually have lot more data in the lake than ever in the warehouse, but is the data reliable? Can I actually guarantee that the schema is going to stay the same? How easy it is for an analyst to merge a bunch of different schemas together, not very easy. And as a result of all these problems, they have turned into these unreliable data swamps where you have all the data, but it’s very difficult to make any sense of it. So understandably, in that absence of a better alternative, what we are seeing with most organization is a strategy of coexistence.
We have two paradigms in the logs today, keep the most recent data in the database house and the rest of the data in the Data Lake, but ETL data processing and streaming for which spark is widely considered as the industry standard has to be done outside the data warehouse, but the data and the data warehouse also has to be unloaded for ML models to clean on. And what do we do about the predictions then? It’s precisely to answer all these challenges. Databricks has introduced the world’s first Lakehouse, one platform to unify all your data analytics and AI. So let’s do a deep dive of all the components of a Lakehouse and how it addresses the challenges we just talked about. So what do we need for our Lakehouse? We need our transaction layer. You know that when I write data, it either fully succeed or fully fails, all things are consistent.
So we need some transaction layer on top of the files. That structured transaction layer is Delta Lake with asset semantics. To support the different types of use cases, we needed to be fast, really, really fast. And it needs to handle a lot of data writes, rewrites and reads. So that’s the Delta engine, which is a high performance grade engine that Databricks has created in order to support different types of use cases, whether it be SQL analytics or data science or ETL or BI reporting, art streaming all, all that stuff on top of the engine to make that really, really fast.
So let me introduce Delta Lake at a high level to you first, and then we’ll do a deep dive in its capabilities. Delta Lake is an open, reliable performance and secure data storage and management layer for your Data Lakes that enables you to create a true single source of truth. Since it’s built upon a budget spot, we are able to build high performance data pipelines to clean your data from raw ingestion to business level aggregates. And given the open format, it allows you to avoid unnecessarily application and proprietary lock-in. Ultimately, it provides the reliability, performance and security you need to solve your downstream data use cases. So now let’s do a deep dive on Delta Lake and its capabilities.
Delta Lake ensures reliability with asset transactions. Delta applies an all or nothing asset transaction approach to guarantee that any operation you do on your Data Lake either fully succeed or get aborted. So it can be rerun. Because of this data teams can now allow multiple streams of data and modify their Data Lake concurrently, which cannot be done with regular Data Lakes. Because of the ability to work in parallel and spend less time fixing errors, customers have seen a 50% faster time to insight with Delta Lake. The next capability I would like to talk about is schema enforcement. Delta Lake uses schema validation on right which means that all new rights to a table are checked for compatibility with the targets stable schema. I don’t like that. If the schema is not compatible, Delta Lake cancels the transaction altogether, which means that no data is returned and raises an exception to let the users know about the mismatch.
And now we recently launched auto-loader, which is a set it and forget it way to load data in your Delta Lake. Auto-loader can not only enforce schema, but it can also evolve the schema and alert you about the changes all automatically. The third key capability is unified batch and streaming. Delta is able to handle both batch and streaming data, while direct integration with spark structured screaming for low-latency updates, including the ability to concurrently write back and streaming data to a same data table. Not only does this result in a simpler system architecture, it also results in a shorter time for data ingestion to query results. The other key capabilities of Delta Lake is time-traveling. Delta Lake provides snapshot of data, enabling developers to access and refer to earlier versions of data for audit roll backs, you produce experiments. Delta Lake support Scala, Java, SQL API to merge, update and delete data sets.
Performance matters. There are two aspects to performance. How fast you read and write data. And second, how well you organize the data in the storage layer? If you do these two things well, you can be assured that your data analytics platform will be really, really performing. So let’s talk about first how Delta organizes that data in the storage layer optimally. We have a lot of capabilities such as auto optimize, optimize rights or the compaction, but the whole goal of these features is to automatically compact small files into fewer larger files either synchronously or asynchronously. And this really helps when those files are being read for queries.
Z Audit, Delta Lake or Databricks, automatically structures, data files along multiple dimensions for fast quality performance by using Z-order algorithm to go locate related data. What that does is better data skipping. Delta Lake also maintains file statistics for better data skipping so that data subsets is relevant to the query are used instead of the entire table. This partitioning pooling avoid unnecessary processing of the data that is not relevant to the query. So that’s about how well the data is organized.
Now let’s talk about how fast the data is read and written. The core of that is the Delta engine. Delta Lakes compute clusters are powered by the Delta engine. Delta engine has TK pieces, photon, the query optimizer and caching layer. Photon is a native vectorized engine, fully compatible with Apache Spark built to accelerate all structured and semi-structured workload by more than 20X with what Spark 2.4 can achieve. The second part is the query optimizer, the query optimizer extend sparks cost-based optimizer and adaptive query execution with the advanced statistics to provide up to 18 X faster query performance for data warehousing workloads. Then if you compare it to Spark 2.0. And the third piece cache. Delta engine automatically caches IO data and transported into a more CPU efficient format, to take advantage of NVME SSPs providing up to 5X faster performance for table scans than Spark 3.0. It also includes a second cache for query results to instantly provide results for any subsequent runs. This improves performance for repeated queries, like dashboards, where the underlying tables are not changed frequently.
Then the third I would like to talk about is the security and compliance at scale. Delta Lake reduces risk by enabling you to quickly and accurately update data in your Data Lake to comply with regulations like GDPR and maintain better data governance through audit logging. It enables practitioners to securely experiment with reliable data and act on insights faster in a garden’s to government regulatory standards. Because Delta Lake does not require you to move your data into proprietary systems, you are now able to maintain your organization security standards with Delta Lake. With Databricks extensive ecosystem of partners, customer can enable a variety of security and governance functionality based on their individual needs.
One key feature I would like to talk about here is time-travel. Delta automatically versions of big data that you stored in your Data Lake and enable you to access any historical version of that data. This temporary data management simplifies your data pipelines by making it easy to order it, roll back data in case of accidents, bad writes or delete, and reproduce experiments and reports. Your organization can finally standardized on a clean, centralized and version big data repository in your own cloud storage for your analytics. Open and agile. Delta Lake is an open format that works with open source ecosystems, awarding vendor lock-in and enabling an entire community and ecosystem of tools.
All data in Delta Lake is stored in an open parquet format, allowing data to be read by any compatible reader. Developers can use Delta Lake with their existing data pipelines with minimal changes, and it’s fully compatible with the parquet part, the most commonly used big data processing engine. Delta Lake supports SQL DML out of the box to enable customers to migrate SQL workloads to data simply and easily. SQL engine that already integrate with Delta Lake regions from a parquet Hive, Spark, Redshift Spectrum, Azure snaps, parquet store, AWS, Athena, snowflake, and Starbucks enterprise pesto.
What are customers using Delta Lake for? We have seen our customers leverage Delta Lake for a number of use cases, primarily among them, to improve ETL pipelines, unified batch and streaming, BI on their Data Lake and meeting regulatory needs. On improving pipelines, simplifying data pipelines through streamline development and improve data reliability and cloud scale production operations is some of the benefits that our customers have achieved. With direct integration to Apache Sparks structure skimming, our customers can now run both batch and streaming operations on one architecture. And with the very fast performance of Delta Lake with the Delta engine and see what analytics, it enables customers to directly run analytics on top of their Delta Lake with the most complete impression set of data. And of course, customers are able to meet compliance standards like GDPR and CCPA, and keep a record of historical data changes.
So who are these customers? Databricks today have in it’s half of the fortune 500 customers. And this is just a small sample of thousands of customers that you see in every industry. In healthcare and life sciences, we have customers from AstraZeneca and Amgen. In manufacturing, we have customers like Daimler, in media and entertainment, Riot Games, and today we’ll have Comcast talking about their use case. So customers all across the industry sector are adopting Delta Lake. On Databricks alone, Delta Lake has been deployed by more than 3000 customers in production becoming an indispensable pillar in data and AI architectures. On the Databricks platform, more than 5.5 exabytes of data is processed weekly and 75% of their data scan on their Databricks platform is on the Delta Lake.
So let’s talk about a use case on just one of these customers and see how exactly are they using a Delta Lake. I would love to talk about… Barbara, will later on with talking about the Comcast use case. So I would love to talk about Starbucks and their use case. So Starbucks wanted to do demand forecasting and personalizing the experiences when you go on the Starbucks app. The architectural struggle to handle perabytes of data ingested for downstream and melon analytics, and they needed a scalable platform to support multiple use cases across the organization. With Azure, Databricks and Delta Lake, their data engineers are able to build pipelines that support both batch and real-time workloads on the same platform.
This has enabled their data science teams to blend various data sets, to train new models that improve the customer experience. Most importantly, data processing performance has improved dramatically allowing them to deploy environments and deliver insights in minutes. So they have been able to enable data and AI at scale, ensuring that democratization of data, creating a single source of truth, which is their Data Lake and easing the data teams into collaborating with each other. What was the result? Today, Starbucks have puzzled plus data pipelines. They are doing 5,200 X faster data processing. And it takes less than 15 minutes to deploy ML models.
Our innovation velocity has increased manifold in the last year or so. We are investing very heavily in making Delta Lakes state of the art solution for Data Lakes. Just recently, we launched the feature such as ability to read and write semi structure data without the need to flatten even those files, change data feed, schema, inference and evolution, and we’ll keep on innovating and bringing new features to market. So let me summarize the key benefits of Delta Lake, improved analytics and data science by scaling data insights through your organization, by enabling teams to collaborate and ensure they are working on reliable data and improving decision making speed. Reduce infrastructure and maintenance cost with best price performance, and most importantly, enable a multi-cloud secure analytics platform built on an open format. Ultimately, all these benefits make Delta Lake the perfect foundation for your Lakehouse. With that, let me hand it off to, Brenner, to show you some of these capabilities. Brenner, the stage is all yours.

Brenner Heinz: All right. Thank you, Himanshu. So as you’ve heard, Delta Lake is the foundation of a Lakehouse architecture. It’s an open source storage layer for Delta Lakes that brings acid transactions to Apache Spark and big data workloads. And as we’re about to see, it offers unified batch and streaming, scheme enforcement and evolution time-travel, and many other features that make it really easy to manage your Lakehouse. So let’s go ahead and jump right in. But before we do, I just want to make a quick plug for the Delta Lake cheat sheet. You can download it at dbricks.co/cheatsheat and it’s a simple one pager double-sided that includes all of the most common commands that you’re likely to use with Delta Lake. Several of us have helped put this together, and I think you’ll find it to be a great quick reference guide and very helpful as you continue to learn and use Delta Lake more and more.
So let’s go ahead and do just that. First, we want to know how to convert your data to Delta Lake format. When we would create a table, instead of saying parquet, we would simply say Delta as part of the create table command. However, as of Databricks Runtime 8.0, we no longer have to say the Using Delta part, Delta Lake is now the default format on Databricks when you’re creating tables. So it’s really easy to use, you no longer even have to specify. And it just makes it really simple to get started and get all of the performance benefits of Delta Lake right away. So let’s look at actually getting some data into Delta Lake format. With Python, what we can do is we can read our data in, in a spark data frame and then write it out in Delta Lake format.
And we can save that as a managed table called Loan’s Delta that we’re going to use throughout this table. Next, we could also use the SQL command to use a create table syntax. And in this case, we are pulling from some parquet files. The third way to get started with Delta Lake is to use the Convert to Delta command and to then point that command to where your existing Delta Lake files are located. So this is really helpful if you already have big, huge data warehouses full of parquet files, you simply want to convert them to Delta Lake format. The Convert to Delta command makes that easy. So now that our data is currently in Delta Lake format, there were initially 14,705 records in this table, but I’ve already kicked off a couple of streaming reads and writes to our table that are all happening concurrently.
And so now we have over 277,000 records in our table. And as you’ll see this chart here is showing what’s happening with our table. So we’ve created two brand new streams that are both streaming into our Delta Lake table at once, stream A and stream B. And those go along with the initial 14,705 records that we started with in our table. We’re showing you this very same table and this next chart here, but showing you what’s happening over time. Again, those initial 14,705 batch records that we started with were the very first part of our table. But then as soon as we started streaming, we’ve been streaming 500 records per second per stream, ever since then. And so, as you can see, as time goes on, more and more records are getting upended to our table. And finally, in addition to the streaming queries that we’re running on our data set up here, we can also run simple batch queries just for good measure.
So all of this is really to show you that Delta Lake is able to handle batch and streaming and is able to essentially unify batch and streaming in a single table. And that eliminates a lot of the complexity that comes along with integrating batch and streaming data, whether that’s in the past, you would need to build a Lambda architecture. With Delta Lake, all of it simply works out of the box. So I’m going to go ahead and stop all of our streams here and next, I want to talk about asset transactions. So none of this would be possible. The ability to read and write in multiple streams, to and from your Delta Lake tables all at once, none of this would be possible without acid transactions and the acid transactions are made possible by the transaction log. The transaction log provides a central ledger where essentially all of the changes that you make to your Delta Lake tables over time are recorded in this transaction log in this one central place.
So as you can see, even though we’ve only been streaming some data into our table for just a few minutes here, we’ve already got several hundred versions of our table because each new streaming update has been recorded in the transaction log. So the transaction log is what makes all of this tick. You should learn more about the transaction log read more about it, but essentially, you can access that transaction log at any time by running the Describe History command. And this all makes those really powerful. And at Databricks, we found that many of our customers are able to simplify and streamline their overall data architectures by using Delta Lake. They no longer need to have a separate batch and streaming architectures. They can unify those together and using a simple multi hop data pipeline, you can reliably transform raw batch and streaming data into high quality structured data that multiple downstream apps and users can all query at once.
So Delta Lake really is the foundation of the Lakehouse using it will help you simplify your overall architecture. And it’s very powerful. So Delta Lake does a lot more than just unify batch and streaming data though. It also offers tools like schema enforcement and evolution to help protect the quality of your tables. And this is really important. So let’s take a look at an example here. We’re going to generate some new data in this table here that has a brand new column, the credit score column, which is not currently a part of our Delta Lake table. If we try and write this data to our table, with that mismatching schema, that brand new column that isn’t part of our Delta Lake table, Delta Lake stops us from doing that. And this is a good thing because Delta Lake is protecting the quality and integrity of our table by insisting that the data that we’re writing to that table matches with what we expect.
However, Delta Lake also… So that was schema enforcement, but on the flip side, in the event that you actually do want to change your Delta Lake schema to match the schema of an incoming dataset, you can add the merge schema option to your Spark command. And that essentially tells Delta Lake to go ahead and merge those two Schemas. And it allows that right to now occur successfully. So after adding that option and writing our new data to our table, when we query that table, you can see that our two values that we wanted to add up here are now present in our table.
Next, another feature I want to show you is Delta Lake time-travel. And personally, I think this is one of the most exciting parts of Delta Lake, because there are a lot of creative ways, a lot of creative things that you can do with it. First let me show you how it works. We’re going to run the Describe History command again. And as I mentioned earlier, this shows us the transaction log that Delta Lake uses to determine the state of our table at any given point in time. And as part of this transaction log, you’ll see there’s a version and a timestamp associated with each different version of our table. Using this information, we can then recreate the state of our table at any given point in time by using time-travel and the version as ups and tax.
So for example, what we’re going to do here is we’re going to use time-travel to select and view the original version of our table version zero. We’re already on version 162, but we’re going to query our table as though it was just the original 14,705 records. So when we run this command, you’ll see that in fact, because we’ve used time travel to query the very first version of our table, none of the streaming data has yet entered this table. Those were all in future versions. We in fact, got the correct 14,705 initial records that we expected.
And you’ll also notice that that credit score column we added using a schema evolution, the last step is also not present because in the very first version of our table, that column was not present. So that’s just one way you can use time-travel to do creative things. It’s great if you accidentally add or delete a certain records that you didn’t mean to, it adds a little bit of error proofing to your pipelines. And it also allows you to do things like run the Restore command. The Restore command allows us to restore a certain table back to exactly the same state that it was in at a previous version of your table. So what we can do is we can run this command, the Restore command to restore our table as of version zero.
And when we do so, we’ve essentially replaced the table that we had with the initial version of our table. So as you can see now that 14,705 records are the only records present in our table. So this Restore command comes in really, really handy in cases where you have deleted records or you made some mistake. And I found myself using it more frequently than perhaps I’d like to admit. Next there, I want to talk about DML commands. Delta Lake offers a full support for DML commands, like delete, update and merge into. First, let’s show you how delete works on Delta Lake. We’re going to use these 4420 data for the next part of the example. So here’s the w the user whose data we’re looking at here. And if we want to delete this user’s data from our table, all we have to do with Delta Lake is use a single command.
This delete from our table loans Delta, where the loan ID equals 4420. So when we run that command and then we query our table for that record, of course, we no longer see any results because that user’s data has been successfully deleted transactionally. And it’s important to note that these DML commands are not available on plain vanilla parquet. They’re only available on Delta Lake and they make it really, really easy to do the most common commands that you’re likely to run across when you’re building data pipelines or creating features for machine learning and whatnot. Next, I want to show you how we can use time-travel to actually put our user back into our table if you want to using the Insert command. So when we run this command, after it runs, and by the way, you can see metrics on the operation that has just occurred. When we run this command, we insert that user back into our table, and then we query our table. In fact, we do see our user now is back in the table.
And then finally, we can also update our table as well. If we want to set this user’s funded amount equal to 50,000, we use this simple command you see here. And then again, when we query our table for that user’s data, it shows that in fact, the funded amount has successfully been updated. Finally, there’s the MERGE INTO command, and this is really, really useful for upstarts. With plain vanilla parquet, it’s very, very difficult to do these upstarts. It involves a lot of intermediate steps that we cover here, but with Delta Lake, it’s really easy. So to demonstrate this, first, I’m going to create this tiny little data frame here with just these two records, our user that we’ve been querying their data or working with so far, number 4420 as well as a brand new record.
So essentially, what we’re looking to do here is since this user already exists in our table, we actually just want to update their data. But since this user is not currently in our table, we want to insert their information into the table. So it’s a combination of an update in an insert notice and absurd, as many of you are probably well aware. And when we run this command, we can specify what to do when there is a match with existing records in that table and what to do when there’s not a match. In this case, if there is a match, we just want to update that user’s information. And if there’s not a match, we want to insert their information. So after runs, we see that both users are now pressing in our table, user 4420, their information has been updated, and this user’s information has been successfully inserted.
So essentially, Delta Lake makes it really easy to perform those kinds of data manipulations and DML is just such a common language and dialect that it makes it’s absolutely simplifies things for data engineers and data scientists. Finally, I’m going to go through just a few quick performance optimizations that make Delta Lake really fast on Databricks. First, there’s the Vacuum command. The vacuum command deletes, or marks for deletion all the files that are no longer needed in the current version of our table. So that comes in really handy. If you’re doing a lot of transactions over time, you’ll accumulate a lot of different files within your Delta Lake table, some of which are no longer needed. The Vacuum command simply takes those files marks them for deletion, and ultimately, saves you money on cloud costs.
The next optimization, which is only available on Databricks Delta Lake at this time is the Cache command. In the event that you are running a lot of queries on the same data set or running the same query with just slight modifications, the Cache command will simply cache the results in memory. So you don’t have as many look-ups. And so that your results are delivered much faster than it could be otherwise. So this comes in really handy when you’re doing a lot of query on different tables. And then finally, there’s the Z-order optimized command, which is essentially co-locates in the different data in your dataset based upon the Z-ordering algorithm. And what that means for you is faster queries, faster reads, and writes, and it speeds up the entire process. So I hope you’ve enjoyed this demo of Delta Lake, join the community, join the Slack Channel, our public mailing list, check out the actual code on GitHub. And with that, I’m going to turn it over to, Barbara, from Comcast, to tell us a little bit more about how they’re using Databricks there. Thank you so much.

Barbara: Hi, everybody, I’m really glad to be here. Hope you’re all doing well. I’m here to talk about hybrid cloud access control in a self-service compute environment here at Comcast. I want to just real briefly mentioned that Comcast takes very seriously its commitment to our customers to protect their data. I’m part of the Comcast, what we call data experience, big data group. And big data in this case means not only public cloud, but also on prem data. So we have a heterogeneous data set, which offers some challenges and challenges are fun, right? Our vision is that data is treated as an enterprise asset. This is not a new idea, but it’s an important one. And our mission is to power Comcast enterprise through self-service platforms, data discovery lineage, stewardship governance, engineering services, all those important things that enable people to really use the data in important ways. And we know as many do that powerful business insights, the most powerful insights come from models that integrate data that spans silos insights for improving the customer experience as well as business value.
So what this means for the business, there are some examples. Basically, this is based on the tons of telemetry data that we capture from sensors in Comcast’s network. We capture things like latency traffic, signal to noise ratio, downstream and upstream error rates and other things that I don’t even know what they mean. But this enables us to do things that improve customer experience, like plan the network topology to help if there’s a region that has a ton of traffic, we might change the policy to support that minimizing truck rolls. Truck rolls are what we call it when the Comcast cable guy or cable female comes to your house.
And in this COVID times, we really would like to minimize that even more. And if we can analyze the data ahead of time, we can perhaps make any adjustments or suggest adjustments that the user can make to minimize the need for people to come to their house. We can monitor, predict problems and remedy them before the user even knows because of this data. And this involves both the telemetry data and integrating it with other data across the enterprise and then optimizing network performance for region or for the whole household.
Now this is really important stuff and it really helps the customers. And we’re working to make this even more prevalent. So what makes your life hard? This is a professional statement. If you want to talk about personally what makes your life hard, we can do that later, but what makes your life harder as a data professional? People usually say, “I need to find the data.” So if I’m going to be integrating data across silos, I need to find it. I know where it is in my silo, but maybe, and the way we do that is a metadata search and discovery, which we do through elastic search. Then once I find the data that might be of interest to me, I need to understand what it means. So what someone calls an account ID might not be the same account ID that you are used to calling an account ID, billing IDs, or back office account IDs need to know what it means in order to be able to join it, to make sense as opposed to Franken-Data, monster data that isn’t really appropriately joined.
We need to know who produced it? Did it come from a set-top box? Did it come from a third party? Who touched it while it was journeying through Comcast? If it came in through Kafka or Kinesis and someone aggregated it and then maybe somebody else enriched it with other data. And then it landed in a Delta Lake. The user of the data in the Delta Lake wants to know where the data came from and who added what piece. And you could see this as both the publisher looks at the data in the Delta Lake and says, “This looks screwy, what’s wrong with this? Who messed up my data?” You could also say, or they could say, “Wow, this is enriched really great. I want to thank that person.”
And also someone who’s just using the data wants to know who to ask questions. What did you enrich this with? Where did that data come from? That kind of thing. And all that really is helpful when you’re doing this integration. That’s data governance and lineage, which we do in Apache Atlas, that’s our metadata and lineage repository. And once you found data and understood it, you have to be able to access it. And we do that through at Apache Ranger and its extension that’s provided by Privacera. Once you have access to it, you need to be able to integrate it and analyze it across the enterprise. So there’s finally, now we get to the good stuff to be able to actually get our hands on the data. And we do that with self-service compute, using Databricks and Databricks is a really powerful tool for that.
And finally, we find that we do really need acid compliance for important operations. And we do that with Delta Lake. So I can talk about this in more detail, as this talk goes on or in the question session. I’m an architect, so I have to have graphs and line diagrams. So this is a high-level view of our hybrid cloud solution. So income passed on our data centers. We have a Hadoop Data Lake that involves Hadoop Ranger and Apache Atlas working together. We are as many companies are phasing that out, but not immediately, it takes awhile. We have a Teradata, Enterprise Data Warehouse. Similarly, we are thinking to move that, and this is not necessarily to the cloud entirely, but maybe to another on-prem source, like the Object Store, we use [inaudible].
And basically, that makes this Object Store look like S3. So, when the Spark jobs that we write to use on S3 also can run on our on-prem data store. And that’s a big plus of course. And also we have a Ranger data service that helps with access control there. Up in the cloud, we use AWS though Azure also has a big footprint in Comcast. Databricks Compute is the center here. We use it to access Kinesis. Redshift, we’re just starting with that. We use Delta Lake and S3 Object Store and we have a Ranger plugin that the Databricks folks worked carefully with Privacera to create so that our self-service Databricks environment can have all the neat script and the configurations that it needs to run the access control that privacera provides.
We also use Presto and for our federated query capability. It also has a Ranger plugin and all the tags that are applied to metadata on which policies are built, are housed in Apache Atlas and Ranger and Atlas sync together. And that’s how Ranger knows what policies to apply to what data. And in the question session, if you want to dig deeper into any of this, I’d be very happy to do it. So this is very exciting to me, we’re just rolling this out and it’s so elegant and I didn’t create it so I can say that. So Ranger and Atlas together, provide a declarative policy based access control. And as I said, Privacera extends Ranger, which originally only worked in Hadoop to AWS through plugins and proxies.
And one of the key ones that we use, of course, is Databricks on all three of these environments. And basically, what I like about this is we really have one Ranger to rule them all and Atlas is his little buddy, because he provides or she provides the tags that really power access control. So here’s again a diagram. And we have a portal that we built for our self service applications and the user tags, the metadata with tags, like this is PII, this is a video domain, that kind of stuff. That goes into Atlas. The tags in the metadata associations are synced with Ranger, the policies based on that. So who gets the CPI, who gets to see video domain data, those are synched and cached in the range of plugins. And then when a user calls an application, whether it’s a cloud application in Databricks, or even on on-prem application, the application asks Ranger, “Does this user have the access to do what they’re asking to do on this data?”
If the answer is yes, and it’s very fast, because these are plugins. If the answer is yes, they get access. If no, then they get an error message. And we can also do masking and show if someone has access to many columns, but not all columns, I would say a glue table, we can mask out the ones that they don’t have access to. And so give them what data they are allowed to see. Recently, we’ve really needed acid compliance. So traditionally, big Delta Lakes are write once, read many. We have things streaming in from set top boxes in the cable world. That’s not transactional data. That’s what we’re used to, but now increasingly, we are finding that we need to delete specific records from our parquet files or whatever. We can do this in spark, but it isn’t terribly performant.
It certainly it can be done, but it turns out Delta Lake, does it much better. The deletes are much more performant and you get to view snapshots of past Data Lake states, which is really pretty awesome. So we’re really moving toward, I love this word, a Lakehouse, being able to do write once, read many and acid all in one place. And that is largely thanks to Databricks. So this is me, please reach out to me an email if you wish. And I’ll be happy to answer questions in the live session if you have any. So thank you very much for listening.

Himanshu Raja: Thank you so much, Barbara. That was amazing. And with that, this is the wrap up our session. Thanks everyone for being here and we will now take questions via the chat.

Himanshu Raj

Himanshu Raja is a Sr. Manager of Product at Databricks helping customers build open, scalable, and performant analytics systems. Himanshu holds an MBA from NYU Stern School of Business and MS in Phot...
Read more

Brenner Heintz

Brenner Heintz is a cloud data engineer and Technical PMM at Databricks, where he’s helped to accelerate adoption of Databricks and Delta Lake through initiatives like the Databricks Demo Hub and th...
Read more

Barbara Eckman

Barbara Eckman

Barbara Eckman is a Senior Principal Software Architect in Customer Experience Technologies at Comcast. She is the Lead Architect for Enterprise Data Discovery and Lineage, with a particular focus on ...
Read more