How are customers building enterprise data lakes on AWS with Databricks? Learn how Databricks complements the AWS data lake strategy and how HP has succeeded in transforming business with this approach.
Thank you so much for joining us today. My name is Brian Dirking, I am Senior Director of Partner Marketing at Databricks, and I’d like to introduce our speakers.
First, we have Denis Dubeau of Databricks. He is Manager of AWS Partner Solutions Architects. Second, we have Igor Alekseev who is with AWS and he is the Partner SA for Data and Analytics. And third, we have Sally Hoppe who is the Big Data System Architect at HP. Okay, so I will go ahead and hand this over to Igor. Go ahead, Igor, when you’re ready. – [Igor] Thank you Brian. And before talking about data lakes on AWS, I’d like to discuss first, what are the critical drivers for modern data architectures?
First of all, it’s data volume. Data is growing at an unprecedented amount and it’s both human generated and machine generated data. At the same time, data increases in its variety. You have logs, you have a semi structured data, you have your tabular data. At the same time, complexity of use cases is increasing.
Data is accessed in many different ways. At the same time, customers are expecting data to be available in one central location with no silos.
From the customer’s perspective, they want more data. Data is growing exponentially because data scientists want bigger data sets, analysts more have more variety of data. Data now comes from many different sources. It’s increasingly diverse. And it’s used by many different people, many roles. You’ll hear data scientist, you hear data engineer, data analysts. At the same time, customers want to analyze data using many different applications.
So what is a data lake and how can it help us? Data lake is an architecture that allows you to store massive amounts of data in a central location. It’s readily available for analysis, processing, and can it be consumed by diverse groups of people?
Now today, data lakes are providing a major data source for analytics in machine learning. Data can come from both on-premises and can come as real time ingestion into data lake.
So what can data lake architecture enable you to do? So with data lake, you can still structured and unstructured, semi-structure data, as I mentioned. You can run analytics on the same data without data movements, data stays in the same place. That helps with scaling up, because you don’t need to move the data. You don’t need to clog your networks. You should be able to scale storage and compute independently. Brian talked about you used to size your Hadoop cluster based on the amount of data amount of storage you need. But when you separate storage and compute, you can handle use cases where you may store a lot of data but use only, run very small amounts of it analytics, or you can store small amounts of data and run various sophisticated and highly computational loads on it. Schema can be defined during analysis. Meaning you don’t need to impose schema on ingest. You will do schema on read, in this way, it allows the schema to evolve. It will allow more flexible approach to the data.
Data and data lake should have an unmatched, they should have durability and availability. That it should be unmatched at scale. And I’m gonna talk about, in next slide, about one service from AWS that allows to give you such capability. And data should be secure and compliant and it should have audit capabilities. That’s what you would expect in your data lake.
With data lake on AWS,
if you build a data lake on AWS, you should expect it to be open and comprehensive. You should be expecting to use open formats. You should expect your data be secure. Users who are using the data should be properly authorized to access the data, should be scalable and durable, and it should be cost effective.
And the one service I wanna highlight that allows you to achieve that, it is S3. It’s secure, highly scalable and durable object storage with millisecond latency. You can store the data in any format you want. So there is, here are some examples. You can store CSV or ORC. You can store Parquet, you can store images, you can store videos.
You can store sensor data coming from IoT. You can store web logs, which can be semi-structured documents. All these types of data are possible to be stored in S3.
Data in S3 can be stored with, S3 itself was built with 11 nines of durability. S3 supports three different formats for inscription that addresses the security concern. It supports, on S3, SSEC and SSE, with KMS.
You should be able to run, you will be able to run data on data lakes built on AWS on S3. You can run analytics on it. You can run machine learning workloads on the data lake without movement data. You can classify, report and visualize data that you have on S3.
Another important aspect of having data on S3 is, how do you move to the data? And AWS offers a wide range of services. So for moving data from on-premises data centers, you can use the AWS Direct Connect by a network. You can also use Snowball and Snowmobile. These are offline devices where you can, there’s a physical device that you can bring and attach to your network and then transport them to AWS data centers. There’s a Database Migration Service that helps you connect to your on-premises database and move it in organized manner. There’s a Storage Gateway. There are also real-time mechanisms to move the data. Such as IoT Core, Kinesis Data Firehose, Data Stream, Kinesis Video Streams in managed Kafka.
Important things to consider when moving data from a premises, you wanna have a dedicated network connections. These appliances will be secure.
The Snowball actually, it’s a rugged shipping container.
And the Gateway allows you to ride data directly to the cloud with some local caching.
AWS offers most comprehensive and open services for your data analytics. So if you think about this layer, you have at the level, you have migration, and we’ve talked about some of the services. Then you have security management, data catalog. The same time we have, at a higher level, we have analytics services you can have, you have Redshift for Data Warehousing.
And then moving higher, you have the dashboards such as QuickSight. And Databricks, if you think of Databricks, Databricks can fits in this picture as a crosscutting concern. Crossing all of this layers, you can ingest your data using Databricks, you can process it using Databricks, and you can visualize it using Databricks.
– [Brian] Great, thank you so much, Igor. Okay, and then next, I’d like to hand it over to Denis. – [Denis] Hi, my name is Denis Dubeau and I’m a Partner SA Manager at Databricks. In this section, I’ll cover the role and the challenges of the data lake, but more importantly, how Databricks can help your organization solve many of the pitfalls an issue with large scale data lakes.
So let’s first summarize what a data lake is. We’ve covered some of that ground already, but fundamentally, it’s a file system that supports a wide variety of data types, velocity and data volume. It’s also, data can also vary from transactional in nature, which is usually referred as your structured, and also combination of images, videos, speech, web logs, IoT data. Basically, any type of data that can be ingested and stored easily. They also provide an open format, which means that you can choose and take advantage of the most commonly adopted file formats like Parquet, Jaison, CSV Techs. And access that data via multiple applications or services from a central location. You’re also able to separate your storage from compute so that you can store all your data in a single location and only provision the necessary compute as needed, to process your workload. What it really means is that you just, you use just enough and just in time, competing resource needed for the job at hand. And finally, you can scale your storage resources needs to meet the demands of your organization without upfront investment, with the data durability and the lower cost of the cloud storage.
Now, of course, every organization have a desire and a need to operationalize their data. What we found is that it often time needs sophisticated features that you would expect from a relational database management system. So they’ll want features like asset transactions. So either you are able to succeed a transaction or with automatic recovery capabilities. The ability to also have take point in time snapshots and create optimized index for fast query access. Well, having all the benefits of a data lake, like having the flexibility of a schema on read prompts or enforcing your scheme on right immediately as you’re creating your table. Also combining and simplify the reliability of being able to stream and you’re streaming, you’re batch processing. While keeping an open format and no vendor locking.
Now data lakes are great, but they also have a numbers of challenges and complexities. First it’s hard to append and even more modifying existing data, it’s super challenging to do this. Now, you also have to handle the job failure based on your processing framework you’ll use, it’s very complex to manage job failures, even more to restart, which eventually will lead to data quality issues in a lot of cases. There are many performance and related challenges with card storage, even with the increased offering options like solid state drive or GPU instances. It’s still a challenge to have performance access to your data. And most fine grain access control mechanisms are difficult to set up and manage as well.
So there are a numbers of challenges and this is just the tip of the iceberg.
So at Databricks, we’ve developed a new standard for building data lakes. By the way, Delta Lake is an open-source project hosted by the Linux Foundation project, and it’s also available at delta.io. You can find more information on the website directly. Delta Lake uses a Parquet open data storage format along with a transaction log to provide the reliability and the performance to your data lake, while being fully compatible with the Apache Spark APIs. So let’s quickly unpack what I just said here. Delta is really comprises of two pieces. It has a version set of Parquet files that are sitting on your S3 buckets, and as you perform modification to your data, we will keep versions of these Parquets file as well as the transaction log that are also stored on your S3 environment.
So there’s a couple of challenges, obviously that we’ve all already outlined in Delta Lakes. But let’s cover some of the reliability features that we provide as part of our open-source Delta Lake features. First we have a transaction log that keeps track of every operations that are performed on a file. So every inserts, update, delete and (mumbles). So the rights are serialized and your reads are consistent, which means that the Delta will not see or read uncommitted data until the transaction is successfully committed. That’s provided in ACID compliance system. And also guarantees and unifies batch and stream processing so the first two top features on the first OSS feature of banner. Now, the next feature I’d like to touch on is that it also gets, provides schema enforcement out of the box. So as you create your Delta table, the schema is enforced on every right. There’s also a way to avoid the schema if you want, but by default, the schema you specify at table creation is enforced whenever new data is ingested. And the last feature is time travel. Because we have version Parquet files and we have a transaction log, you can actually query previous version of the table. So for instance, you could re-run your report with yesterday’s data or verify the accuracy of a model from a week ago very easily. Now there’re significant benefits of using Databricks on top of your Delta table and we offer additional performance benefits, which leads to a simplification benefits for your data pipeline.
When Databricks writes to Delta, you have the option to turn on auto compaction and optimize rights. So auto compaction operates under the cover and it will auto compact your small files into larger files. So for instance, if you have a streaming job that generates a lot of small files, your Delta Lake will create an adverse performance issues. So typically, you would have a secondary job depending on the framework you’re using, that goes in and compacts those small files into larger files, where Spark actually prefers that.
Delta it can also automatically cache your query results if you use a specific instance type. So what that means is that the first time you query a table, Databricks will create the underlying file system, pull the data into the cluster and then serves up the query. The second time you run that query, another query that used the same query, Databricks will use the data that’s already in the cache. So Databricks maintains the consistency of the cache between your Delta Lake and the local file storage. So it does keep track of what’s on the cache and what’s not. We also provide an indexing feature, which is a multidimensional clustering capability which we call Z-order. So basically provides the ability to organize a set of columns for alternative access other than the partitioning key. And then finally, data skipping is provides significant performance benefit by only reading the necessary files or partition based on the query predicate you provide, that query time. Databricks will automatically keep track of some of the statistics on the column in every file so that you can, when you provide a predicate like where, let’s say customer ID equals x, then we’ll identify the specific files that this customer ID is in it and then we’ll only read those files to satisfy the query. So it provides an order of magnitude of performance improvements over your traditional Parquet-based data sets.
Now that you understand the reliability and the performance features of Delta Lake, it’s super simple and easy to implement this commonly adopted industry framework, where you can ingest multiple data types of our data source into what we call basically a bronze, a bronze layer, or a raw ingested layer. And as you apply filtering and cleansing criteria to produce your silver layer, then as we found as many organization actually grant the power to their power users and ad hoc users, the silver layer itself and also provides or feeds their ML pipeline from that layer before producing our final results or final refinement of that data, that will be then sent downstream for serving your dashboards or your analytics or UPI work-streams. So via this refinement process you’ll incrementally improve the quality of your data until it’s ready for consumption by the serving end points, and also allows you to evaluate the schema and form it’s requirements throughout the layered framework. So as you learn your data and as you evolve it to your seller and goal layer, you may wanna refine or enforce some schema definitions.
So if we put all this together, now that we understand what the data lake is, some of the challenges, and how Delta actually allows us to manage those reliability and performance characteristic, it’s important to understand that Databricks fits into the AWS ecosystem. Now, there’s a couple of key points here to take away from here, is that Databricks is actually deployed, and all these clusters that Databricks are provisioning are deploying the customer’s VPC. So they’re deployed within the boundaries of your VPC that you maintain. We also interact directly with the S3 layer. So the data remains where it’s at and as the data lake gets supplemented, then we re-write back to that same data lake layer. So you have full control of your data and it’s an entire lifetime.
And we also have, as you can see, those are numbers of first party service from a (data and AI landscape). So we have a full set of optimized connectors with first party integration on the AWS ecosystem. For instance, if as you’re consuming data, ingesting data, if you’re reading this diagram from left to right, when you read and ingest data, we have specific connectors to Kinesis. We have an optimized way of reading and writing back to S3, as well as a numbers of connectors to connect directly with Glue as your enterprise data catalog. And feeding data, or serving data to your Redshift layer or Athena then has the ability via Glue to connect to your your data lake. And as well as MLflow, which is another open-source project that has a very simple way to actually serve a model that you would build and taste in Databricks and then (mumbles) to SageMaker (mumbles) for real time serving options. So a numbers of different connectors and optimized ways to ingest and serve data throughout the ecosystem while maintaining and residing within the customer’s to VPC and leveraging your Delta Lake and supplementing complimenting your Delta Lake as your data grows, without moving that data outside into a different file format.
There’s a numbers of additional services that we also integrate with that you would expect something like identity access management, cloud formation, cloud trail. These are all management and governance capabilities within the AWS ecosystem, as well as a full integration with SSO Databricks is fully integrated, actually is one of the application title now available on your AWS SSO offers, simple administration and deployment. And a numbers of other services as well that we integrate with. From the secret service to capability or step function capabilities. So there’s a number of services that we have a great integration with, but that just gives you a sense and a taste of some of the services that we support.
– [Brian] Awesome, thank you Denis. Okay, and then next, I’d like to hand it over to Sally. – [Sally] Hi. I wanna thank Databricks for inviting me to speak and I’ve been, it’s been great working with Databricks and with AWS so it’s pretty neat to be able to share out the work we’ve been doing here at HP and how we have transitioned from an on-premise solution, to a solution using AWS and Databricks, and then how we’re further evolving our solution to go to the next level. So go ahead and go to the next one.
My name’s Sally Hoppe, I’m a master architect in the Print Big Data organization for HP. We work with home, office and industrial printing. Other areas I’ve worked on in the past is OpenStack, telepresence, seismic sensing, specialty printing. And way back when I was doing oceanography research, I worked with Big Data with really big machines and that was fun. Next slide.
So in HP, we’re getting telemetry data from our printers and we wanna find out more about our customers, how they’re using the data to try to build better products, understand when there were problems with it, and really do our analysis of that. The group that I’m apart of, we bring in the incoming data, we wanna make sure that we can cleanse it, that we normalize it and make this raw data that is coming in, and make it usable for our data scientists. And create dashboards that help drive our business, explain what’s going on, and being able to provide that feedback that we need, to not only continue with the business that we have now, but create new products. And so, a lot of the pathways through that Denis and others were talking to, bringing in the raw data, cleansing it, posting it, normalizing it, and providing that data to doing machine learning, analytics, creating dashboards, that’s what we do. So I’d like to show you some of our architectures, tell you some of the problems that we’ve had, our solutions and where we’re going next. Okay next one.
So we had this in stock. We had a legacy solution that was using Hadoop. It was on-premise and it wasn’t scaling the way we needed it to. And so, we went to AWS and to Databricks and we transitioned our solution from being on-premise to a cloud-based solution. We were using Hadoop, we went to Spark, and really were able to transform a system that was taking months to process into weeks. a system that was taking months to process into weeks. And now we do it daily and we’ll try to do more on streaming and we’ve just been able to accelerate when we can provide results to the business and expand the number of platforms we’re supporting, expand the number of data sources and the amount of data.
So this is an architecture slide that shows one of the architectures for the pipelines that we have that go through our system. We have about 30 different pipelines that are based on this architecture. And one of the really nice things about it is we’ve been able to configure the pipelines, so that they can address different data lakes as a, different repositories as a starting point, and go through and create bronze, silver and gold data lakes. So if we start from the left, you can see we’ve got printer telemetry coming in, it is different… A whole variety, in jet laser or home office, large format, they’ve got very different formats over time. The data is different, of the issues with the formula and there was areas that we need to handle, all of that is coming in. And so, one of our first activities is to ingest that data. And so, we bring it in, we use Kafka and Kinesis Firehose, and land it into a data lake. And that data lake is where we have our landing data, just starts out. This is where we have that raw data. And that’s really our bronze status, pretty raw archival data. And then we have a first stage of processing it. And one of the reasons we have this first stage that is our raw processing is because our data is coming in many different formats, and our goal is to normalize it in the end, so you can do machine learning and analytics on it. And so this is why we have a multi-step process that we go through is, the first step is we transform everything into a similar structure, so that when we start posting out the data and pulling the pieces out of our semi-structured data, that we have it in a consistent manner that we can use Spark effectively.
So talk a little bit about some of the evolution and so that’s then our batch processing and now we’re moving more into streaming. And you can see that we’ve got our data coming in, we’re using Kafka. One of the things we’re doing, which is really interesting to me is as a data is coming in, we’re updating our dashboards that are in Databricks. And we are seeing the data that is coming in as new products come online and that helps us profile, and being able to identify what type of data are we getting, are there errors, and being able to more quickly respond to the information that we’re getting in, even before it starts going through a pipeline. And then, we’re also looking at different ways to work with our IoT data, and so we’re looking at our aggregation of events, creating snapshots in time, providing history, and it’s really fleshing out data products that we’re providing. We talked about data catalogs and that’s a critical part of our portfolio, is our data products go into a data catalog that we access. We do use Glue and have those Delta Lakes available, with Athena, as well as the Parquet data lakes. We bring our data up into Redshift using spectrum as well as doing a direct load. And so, we really are looking at that full landscape end-to-end of being able to not just process the data, have all of our data quality checks in there, but also being able to have that data findable and re-usable by multiple groups. One of the key things that we discovered is as we’re making data products, we found that people making the same data products over and over again, because they weren’t aware that an existing data product already was available. And so by having a data catalog, providing the appropriate levels of permissions in security, we’ve been able to scale by not having people create the same data products but being able to use the same data products in the data lakes, being able to use similar dashboards, and being able to be more self-service. And so that’s been a really great transition for us, is to be able to democratize and make our data sets more published and available.
– [Brian] Awesome, okay. Thank you so much to all three of you for the session today. It looks like we are out of time. So if you have any questions by all means, please go visit the AWS booth.
Igor Alekseev is a Partner Solution Architect at AWS in Data and Analytics domain. In his role Igor is working with strategic partners helping them build complex, AWS-optimized architectures. Prior joining AWS, as a Data/Solution Architect he implemented many projects in Big Data domain, including several data lakes in Hadoop ecosystem. As a Data Engineer he was involved in applying AI/ML to fraud detection and office automation. Igor's projects were in variety of industries including communications, finance, public safety, manufacturing, and healthcare. Earlier, Igor worked as full stack engineer/tech lead.
Sally Hoppe is a Big Data System Architect at HP. With a background in math and computer science, she is a versatile software engineering professional with experience developing enterprise software solutions and managing cross-functional teams. While working for a large corporation, she has sought out opportunities in new businesses to learn new technologies and work with passionate co-workers. Because she likes to make order out of chaos, she frequently finds herself in positions that require both deep technical knowledge and management skills.
Denis Dubeau is a Partner Solution Architect providing guidance and enablement on modernizing data lake strategies using Databricks on AWS. Denis is a seasoned professional with significant industry experience in Data Engineering and Data Warehousing with previous stops at Greenplum, Hortonworks, IBM and AtScale.