Phar Data Platform: From the Lakehouse Paradigm to the Reality

May 28, 2021 10:30 AM (PT)

Download Slides

Despite the increased availability of ready-to-use generic tools, more and more enterprises are deciding to build in-house data platforms. This practice, common for some time in research labs and digital native companies, is now making its waves across large enterprises that traditionally used proprietary solutions and outsourced most of their IT. The availability of large volumes of data, coupled with more and more complex analytical use cases driven by innovations in data science have yielded these traditional and on premise architectures to become obsolete in favor of cloud architectures powered by open source technologies.

The idea of building an in-house platform at a larger enterprise comes with many challenges of its own: Build an Architecture that combines the best elements of data lakes and data warehouses to accommodate all kinds from BI to ML use cases. The need to interoperate with all the company’s data and technology, including legacy systems. Cultural transformation, including a commitment to adopt agile processes and data driven approaches.

This presentation describes a success story on building a Lakehouse in an enterprise such as LIDL, a successful chain of grocery stores operating in 32 countries worldwide. We will dive into the cloud-based architecture for batch and streaming workloads based on many different source systems of the enterprise and how we applied security on architecture and data. We will detail the creation of a curated Data Lake comprising several layers from a raw ingesting layer up to a layer that presents cleansed and enriched data to the business units as a kind of Data Marketplace.

A lot of focus and effort went into building a semantic Data Lake as a sustainable and easy to use basis for the Lakehouse as opposed to just dumping source data into it. The first use case being applied to the Lakehouse is the Lidl Plus Loyalty Program. It is already deployed to production in 26 countries with more than 30 millions of customers’ data being analyzed on a daily basis. In parallel to productionizing the Lakehouse, a cultural and organizational change process was undertaken to get all involved units to buy into the new data driven approach.

In this session watch:
Marc Planagumà, Head of Data Engineering, SCRM – LIDL International Hub

 

Transcript

Mark Planagumà: Hi, my name is Mark Planaguma, and I’m the head of data engineering in LIDL SCRM International Hub. And I’m here to explain the further the platform architecture and as an implementation of a Lakehouse, and also other architectural patterns. Then, let’s go to see what is inside this solution, and Phar at the end, it’s a success story of building data platform that could be considered a Lakehouse inside a big enterprise as a bar’s group that is more famous to be the owner of the leader supermarkets.
Well, this is important for the context because LIDL and Phar’s group is the third retailer in the world just behind Walmart and Amazon. It’s the first in Europe and is really big retailer. It’s not a digital native company has still yet, but it’s true that he was finding a way to build these project on a most native environment as possible.
Then, for this reason, we can share with you that the context where was built this Phar in the platform. Then this starts as a spin-off called it as SCRM was completely owned by LIDL, but was a split from the organization and from the structure of LIDL, providing the Greenfield to build an organization and products in the most digital active way. Then this starts around 2016 in Barcelona. I split from LIDL and in 2019 was part of the organic of Phar’s group becoming a LIDL International Hub based on Barcelona.
Then with this context is where it starts the goal of this company to build the loyalty program of LIDL is called LIDL Plus, and is the loyalty product of LIDL and is fully designing from the beginning as a digital concept. At the end as all the loyalty programs, it aims to a data explanation, crossing the user data wall with the problem data wall, trying to de-anonymize the purchasers, and don’t have anonymous purchasers have purchasers related with user and mix these both walls. The user data wall, and the problem data wall. Then for this reason, this hub have the wall to build this loyalty program, but also have the wall to explore and extract insights from this data.
Well, this is the context, and this is the wall and was using this Greenfield to build our own data organization, our own data team as a main enabler of build our own data platform. Then let’s see what’s the challenge that we was trying to solve with this far, seeing the challenge and the goals. Well, first of all, we was in front of the same challenge or problems that have a lot of companies that have a lot of data and also have a lot of teams that wants to create products in top of the state.
This team starts getting data from the sources, from the instructional systems, even some cases do the analytics inside the operational systems, but these could be provide a lot of problems as you know, but we was trying to solve all these situations because these vertical integrations about each team getting the data from their own, from the sources was a big number of integrations to maintain was impossible to solve the data consistency, the data quality was something really hard to do and really speed it on each project. There was no monitoring about how it’s going on, everything. That centralized data access data management was not possible even share data, what was a nightmare between them, and the time to market for the projects was really slow.
Then we was trying to solve all these problems with this Phar concept, but not only creating a platform, not only creating technology. Our ADA as far as a product, it’s a platform, but it’s a data market at the same time, working and playing together. Then to achieve, ADA the final goal is this analytics democratization for the different teams. That these are the end means that we won a lot of different departments and different internal teams in the Phar’s group, able to explore and create a lot of products autonomously one each other.
Then our main goal is to achieve these democratization was build a platform and a market that provides these ingestion capabilities. This giant that the scales on batch processing and stream processing. Cross services like security, access monitoring, logging, government lenient, and mainly these workspaces where all the teams are able to work independently and get this data and build their own data products on top of this platform.
Okay. We know exactly the challenge that we want to solve, the goals that we want to achieve. A little bit the structure about our features, but let’s see how to achieve this data market and data platform bringing together. Let’s put the typical data journey of analytics that we have data in the input. We have an analytics on the output, different kinds of analytics, BI advanced analytics, machine learning, whatever.
To achieve that we was creating organization and thinking that in the first features that need to provide the solution was technology assets like different tools and frameworks, [inaudible], security access solutions, lots of metric solutions. We was starting to work on this areas to provide solution on that, but as well, at the same time with this also focus on governance, providing features like contract modeling, data market modeling, and the data feeding itself. Because if we want to provide a platform that this technology that includes data inside, we need to do the feed as well.
Finally, we achieve this loyalty customer, a data market, a 360 view of the user data and the product data in the same place, ready to be used as an analytical workspace on a BI environment, and also providing the data as an output. Then this was the goals and the vision of the platform, but we can do a zoom-in also in the details of the data market. This data market in our case is divided by layers. And each layer is a specific market.
We have the first layer that is called a draw will appear during this presentation a lot of times. This is kind of gross market. This layer is not accessible, it’s just for certification, and we are storing the data as arrive to make a copy of everything and to be able to reproduce everything that is arriving to the platform. Then the first layer that is accessible by the teams, analytical teams, is the L1. This is our primary market. It’s even based, we have a big collection of events there. The users are able to consume as a market, all the data that is in the platform.
Then we have the last layer of the market, is this end layer and is the domain market that is based on use cases. This is a split with different spaces for each team, and each team is able to get the original data from L1 and create their own models and share with others. Then in terms of market, the people that is using Phar have access to all the data, all the primary marketing in L1, and a lot of domain markets generated by the different teams using the platform.
Well, we know right now that the concepts that we are following to build a platform and build the market at the same time, but this session is about the architecture and we will show their architecture that is behind Phar the different contributions of different patterns and carvings of an architecture that are in the market that we are using and adopting in our platform.
Let’s start with a layered architecture. Layered architecture is a famous architecture used, not mainly on the operational side on applications, but we have some application in our analytical platform. In the layer architecture we see this presentation, domain and data stages. We do inform these on this data hub, data market and analytics workspace stages. Let’s see in more detail, what means these stages in our proposal.
Well, we have again the joining of the data from the source to the product, and we are moving these data through this different pieces inside the Phar product. First, this data hub, a stage where the data is arriving, and is attract from the source and it’s make it actionable from the source. Then it’s promoted to the data market where this data, it’s put it available for the final users. And we as well provide a place where work with these data, create their own models and create their own products.
Then as a platform, we have these stages with this part of the platform we are providing cross features like a protector framework and tooling security monitoring for all these stages. But then there are features that are concentrated in part of the stages. Also, we are working on governance, focus on the data hub and data market Phar because at the end, this data hub is where the data comes and we will define after, but this data defined by contracts.
Then the data market is offering contracts as well for the customers. Then we need to control a data governance to make actionable these data on data hub and data market. And then we have another feature that is that share ability. On data market and analytics workspace, we need to work to make it feasible and friendly, the job of the data analytics teams. That means we need to make it interpretable all the market between them, and we need to provide a scalable platform where to process the data in the analytics workspace.
Now we can jam in another approach. This approach comes from Databricks is the Delta Lake approach and is famous for these layers, the bronze, silver and gold. Also, it’s focused on used Delta Lake. The Delta format we are using in some part of the platform, not all of them, but it’s true that we have the equivalence of these layers. Bronze RO, silver is our L1 one gold is our NN. Let’s see the detail.
Well for us, the journey, the data starts from the source is ingested on draw, is integrated on that is our bronze is integrated on L1 is progressive on NN and is consumed to the product. Then for us, RO is providing this as certification, L1 is providing the market where the data is available, but is in a [inaudible] and NN providing this wall area where the people can prepare and put the data more close to the business, more close to the use case. But we can see these layers in another way around in layers in circles, because all the layers at the end contains the other ones.
Let’s see then the difference. For us, we have this ingestion in the beginning, we have this first layer that this is RO, RO is based on events. And we have stored all these events following a specific contract. We have a contract for each input, and we are storing these events in a RO format. As the contracts define that this data will arrive. Then we are promoting these data on L1. L1 is a public market and is classified by entities. It’s providing a concept that it’s providing concept provided by Phar. That is the semantic lake. We are classifying this data that arrives in RO in different entities, depending the meaning, the semantic meaning of the data, but make it actionable one each other, but not touching the value. Nothing is segregated, always classify it to be consumed.
Then we have the NN that is a working space available for the customers and create their own models. Then is a way that can be done specific models for specific products, BI models, Delta Lake models, whatever the teams wants to. Well, after seeing these layers, we can go also another approach of a protector is this event driven architecture where we are getting some concepts, but only for a part of the platform, not for everything, ut it’s a really core part of the platform that is the beginning is the data hub.
The idea of the event driven architecture it’s based all on events that arrives on a data store and some subscribers getting these events that are stored in these event store Rucker to fulfill different kinds of platforms. That is also our idea on the data hub part. We have the sources, back sources, and real-time sources that are sending messages, objects to our data hub that contains a couple of landing zones. The batch landing zone and the real time landing zone, but the input is controlled by contracts.
This means that each message or each data that arrive in the platform is defined by a contract with a life cycle management, with version control. And we are able to check if this is this message is fulfilling the scheme or not. For both sites for real time, with single events and [inaudible] with files that contains inside a collection of events. And then after this data hub, we are able to feed our data market, but as well, we are able to fit other platforms interested to consume this data just with the RO value from the beginning.
Well, let’s put this concept as well in our layers then appears a new layer more this landing zone. And we can do the journey again. We have these ingestion from the source that arrives in our data hub. This force that this data will arrive on even based, okay? Fulfilling specific contracts that we did, and we define. Then it’s validated, we are able with this contract able to validate and put these data on RO store, these events on RO. Promote and do the integration to build the data market. And then we prepare and make actionable data for the different teams in the NN and top layers.
Another one architecture where we are getting concepts is the Lamda architecture. Lamda architecture is a famous architecture from [inaudible] that combines the batch analytics with streaming analytics. One from the beginning, exactly the same capabilities. We are building a byte framework and assuming framework. This means that these layers, landing zones, RO, L1 and NN, we have these layers in both walls.
We have assuming landing zone, RO streaming, L1 streaming, NN streaming and the same for bytes. So landing zone, RO bites, L1 bites and batch. The idea it’s reproduced the same stages, but provide technologies that are specifically for each framework and also provide a way how to combine. And this is what happens on our workspace. This workspace is created by Databricks and at the end is a really nice feature that is providing data reach to us. which to us. Because on this Databricks workspace, we have connected the byte framework and the streaming framework as well.
For streaming RO, and L1 are CAFCA clusters with CAFCA topics. And for batch, we have the data zone distributed file system on Azure data lake on RO and L1. This means that have these connected and provide these connected on the same workspace, the teams are able to consume the CAFCA topics and the files on the entities from Azure data lake on the same workspace, and are able to use Delta Lake, for instance, to cross this batch data market and this streaming data market.
Then we have these layers getting concepts from different places. We have these capabilities to do by channel streaming at the same time, but we are following another approach, this multi-cloud native approach. Why? Well, because our platform is fully built on cloud. It’s mainly built on Azure because in the beginning was the uni-cloud, that was a bit low for us, but then was available as well, Google Cloud. And right now, to put an example here, we can see another way our architecture. I think the bottom line that is they need to have with landing zones. In the middle, there is the data market with batch layers, RO layers, done with Azure data lake and CAFCA.
Databricks is in all the stages to transform data. We have the [inaudible] the store that is as well in Azure, but we have also our schedule that is Google composer, as you know, is an [inaudible] from Google. And this means that our platform is built with cloud services, but in this case services from different clouds. We have the scaling coming Google on the storage and the competition Databricks provided by the Azure and all put together and really connected to only provide one solution.
Let’s go then for the final part because we get different concepts from different classical patterns. But also we are getting something from lake house. Then the question is ask if Phar is a lake house because Phar starts to be built and be designed before appears this parting before we make it public by Databricks. But well, the ideas was really close and each through that after this release, we was also getting some features of Databricks helping to be adapt and be more close to the lake house.
Then we have this ADL of mixing data warehousing and data lake, but let’s put the concepts and the steps stages of Phar on top of this diagram. We have the data hub in the bottom, actually getting the data from the sources, a lot of data, but in our case defined by contract. We have the data market preparing and make it actionable and regular data for a lot of consumers. And we have these workspace that are already used by different teams, like BI teams, the science teams, analytics teams that are already in place working, provide, creating their own product on top of this market.
Let’s see more detail because at the end lake house is fulfilling some key features. And let’s see if Phar is fulfilling what this lake house concept is doing. In the beginning, we have this support for different datatypes. We are supporting all the data types that can be find in a contract, everything. Structured and not structured can be arrive on the data, [inaudible]. We have the storage and [inaudible] because we are working on cloud. And we have the different services, the storage and the competition on data [inaudible] or data lake.
We have a strong enforcement and governance. We have a validation in the beginning, we have a contract in the output. We have a full quality check in the middle that we are doing a lot of effort, and we are providing a lot of tools to this governance. We are openness because in our market, it’s not based on [inaudible] or Delta like is defining the silver layer or on the lake approach, but we are working on ORC on L1, but we are putting the data interpretable in one format. Then we are achieving this feature as well.
We are end to end streaming. This means that we are providing these as streaming features. This transactional support is available and using Delta. All the dashboard team, loyalty program are built on top of the Phar platform. This means that we are providing the actual support and we also have some other diverse workloads like recommender systems, forecasting tools, fraud detection in top of the platform.
Well, seeing what we achieve from these patterns is a moment to explain what we are putting on top of the table as something new in these platforms. The first thing is this concept of semantic lake. It’s focus on the L1, as you can see in the role, we have a data types collection is a collection of events defined by contracts that at the end are merged in these entities.
The entities are weak ORC tables with the schema and the mestastore that have a semantic classification. In this case, you can see example that all these events related with user tracking arrived to the entity, fully dedicated to the user tracking. Then this semantic lake is providing a market with some tables with all the events with a RO value, but classify it by domains.
Then the data contracts. We have data contracts in the input. We have data contracts in the output. Data contracts in the input are the datatypes. These events defined by contract and with version control. And then in the entities, these tables also have their own control and version control. And the mestastore is the conduct that we have also in the data catalog, and we provide as market as a stable market to the users. This is another concept new in our platform.
At the end, another concept is this what spaces with the multi tenant mestastore. This is a extension of mestastore that we did internally in the team in order to achieve that this work space is, all of them are able to rate the L1 market, but also are able to decide if they want to make it public, private and share the schemas and the models that they are doing on the end layers.
Then doing this change extension in hive, the data and the schema that appears in Databricks, each team is able to decide if they want to make it public, rate for everybody, private, only for the specific team on share that it will be rate for a specific teams that they want to share.
Finally, some numbers and some lessons learned. Well, first numbers the Phar have they done from specific countries of where the loyalty program is working, have the data from more than 30 millions of users that are using right now, the leader plus in Europe, that in the near future will be available as well in the states have actionable data on the market, close to the 700 terabytes of data even based, and right now have more than 20 teams doing analytical products, like recommender systems, fraud detection, machine learning to create automatic coupons, whatever.
Another point it’s this how we achieve it. We achieve this not only the way in architecture, which is doing teams and these teams are mainly focused on these three things. The data platform team, the data management team and the machine learning team. These three teams at the end are part of the big family that is doing Phar, and we have these overlaps that helps a lot to us to build this platform.
From data platform teams and the data management teams, you have this governance that we are doing together, form data platform team, and the machine learning team, we are working to make it more actionable, and make better and bigger this platform and the tools that are providing. And from machine learning and data management teams, they have these overlap to try to make the data more close to the asset, to be a product.
Then these three teams at the end are the core reason to build these Phar data platform, and this platform that is also available for a lot of other teams. Each of them have their own goals, but all together are creating the plan. And finally, it’s the insight that when we become data driven, and this, the detector part has a lot to us, because as I said before, Phar is the platform is a saying that we need to do a data contract in the beginning.
This means that the analytics effort starts in the source. When we need to do this contact, when we need to analyze the data that we want to extract is when start these data way of thinking. And this was accelerated a lot when the producers and the consumers starts to being the same efforts because when a data user wants to get some insights from their own product are really interested to feed in the best way, the platform.
At the end, these feeding starting, analyzing and defining the analytical objects from the beginning is from one side, it’s helping a lot because the data is more claiming the platform, but in another fact site is helping a lot in achieve these data driven thinking.
That’s all. Thank you so much. Here you have my contact details and thank you to attend this presentation.

Marc Planagumà

Marc Planagumà is Head of Data Engineering at SCRM-LIDL Digital hub. Senior Data Engineer expert on building and managing highly scalable platforms and teams for advanced analytics. He is currently f...
Read more