Architect’s Open-Source Guide for a Data Mesh Architecture

May 27, 2021 03:15 PM (PT)

Download Slides

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?

In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.

The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.

This session is targeted for architects, decision-makers, data-engineers, and system designers.

In this session watch:
Lena Hall, Director, Microsoft

 

Transcript

Lena Hall: Hello, and welcome to my session. I am delighted to share my experience and offer insights to other architects and system designers on what’s next in data architecture and where Data Mesh comes into play. Data Mesh is an intriguing organizational and architectural concept. It’s valuable, not only for its new approach, rethinking how we structure teams, organizations, and build a data architecture. I truly believe that it will ultimately lead to an important shift towards better standards and practices in our businesses, along with improved interoperability of modern data tools. Even if Data Mesh concept isn’t the right path for your particular architecture, you will still learn about why the idea is a curious are good for further development of practices, standards, and even products for the cloud and data architecture space. My name is Lena hall. I’m a Principal Technologist at Microsoft previously worked at Microsoft research. I come from an engineering and architectural background. I’ve been in the industry for more than a decade at creating architectural solutions and leading teams in the area of cloud, data and analytics, distributed systems, machine learning, and scalable computing.
So Data Mesh is a concept introduced by Zhamak Dehghani. There are two exceptional must read articles by her on the Martin Fowler blog. I would definitely recommend you go through them. There was also a vendor neutral community on Slack with many real world questions. Excellent for knowledge sharing.
So, Data Mesh advocates for the benefits enabled by decentralized data ownership and domain focus data products, working on top of self-serve shared data infrastructure and following global governance and standards. So let’s see what are the challenges it’s trying to solve and why, and when it makes sense to consider it. First of all, it’s not exactly accurate to directly compare Data Mesh to concepts like Data Warehouse, Data Lake, or Data Lakehouse that are currently widely used in data architecture of many businesses. I think of Data Mesh is more of an organizational and architectural paradigm rather than technical architecture paradigm. And the idea is that Data Mesh carries and lie in offering decentralization, how their organization is structured, where the data ownership lies. Whereas I see concepts like Data Warehouse, Data Lake Data Lakehouse, as more of technical architecture concepts. And each of them can still be applied and be useful as a part of centralized or decentralized data architecture.
And very importantly, Data Mesh is not for everyone. If your company already has a well-functioning data architecture based on a monolithic Data Lake, Delta lake or Data Warehouse, or even a combination of these, and you’re happy with everything, you probably don’t need a Data Mesh right now. Most small and medium sized companies can get away with a monolithic core centralized data architecture. Of course, there are always exceptions. So let’s look at what challenges in an organization might be a good indicator that decentralization concept and the Data Mesh might make sense. Let’s say we’re in charge of a Drone Delivery Service. Our organization consists of many different departments, such as drone device management, logistics, online services, customer support, innovation and research management, and likely many more. The company’s drone delivery businesses expanding with large and quickly growing number of areas, data sources and datasets. There is essential data lake.
That is the main destination for data collected from drone devices, online services, as well as operational data. And most of the teams work with the data team with the data lake and also right to the data lake as a result of their own functionality and processing pipelines. There are also a couple of data warehouses used by logistics, marketing and management teams created this more structured view of some parts of the data from the data lake to make it more friendly to end consumers from these departments. Often when the change is introduced, for example, let’s say the data format is changed or fields are added. It can add friction or delays to consider this change in the teams that work with the data because of each of them would have to incorporate the necessary changes in their respective data pipelines. And when there is an issue with the data in the main data lake, their organization often is waiting on the data engineering team to resolve it.
Often it’s hard to communicate across the org because data engineering team are overloaded and siloed with very little domain knowledge. So to summarize, there are several common indicators here. Data is everyone’s, but it’s no one’s at the same time. It creates uncertainty of data ownership, which leads to lower data quality, creating more work for various teams to use the data. The piece of change from an idea to production is further slow in a monolithic system, and generally affects many teams tied to a certain data in a centralized data lake or storage. And when business grows and it needs a faster change, the data engineering teams becomes a team that everybody relies on and depends on, and it can result in major delays.
So when an organization runs into these types of issues and the monolithic data architecture fails to keep up with the need for the new growth or speed of advancement, then looking into decentralized approaches like Data Mesh might be a good idea. So now that we, that now that the drone delivery company realized that Data Mesh can help with their challenges, let’s explore the core ideas that compose the Data Mesh concept in the context of this company. So just to remind Data Mesh is a concept of decentralized data ownership and domain focused data products working on top of self-serve shared data infrastructure and following global governance and standards. So let’s explore what this is.
Decentralized data ownership is different from centralized data ownership in a centralized drone delivery company, there was one large system access by variety of components, creating an issue with data, quality, data, reliability, and lack of clarity. With decentralized data ownership, the company can bring data ownership to domain teams combining the resources of data engineers and subject matter experts in a particular domain. So this will allow the drone delivery company to bring responsibility to data and hence improve its quality and increase the value it brings. Different domain teams can still physically store data within one service on their shared storage infrastructure, if they want to. And they can have separate fine-grained access control mechanisms because the data is decoupled from other domain teams. The teams are still empowered to innovate and make changes at their own pace, without affecting the data usage of other consumers. They can even use different data processing and analytics services with the same data.
Now that decentralization of data ownership is clear what our data products and contexts of Data Mesh? So it is an architectural Quanta or a Lego block in a data mesh architecture. It aligns with the principles of Domain Driven Design and product thinking where each unit would be serving a certain purpose and context of business domain. And you can think of this architectural unit as something that offers and serves valuable data is output to other consumers in the format, useful to them, as a product data product can accept information from external datasets or data sources or other data products through input ports. It can offer data to consumers through one of many output ports. It can expose metadata and other metrics through boards and within the scope of a data product, there can be one or more services or applications that fulfill its purpose. For example, in the drone delivery service, we can have data products like customers, routes, delivery, drones, demand, prediction, and many more.
So to democratize creation of data products and lower the barrier from, for the domain teams to start creating data products, Data Mesh defines the concept of self-serve shared infrastructure platform. If we go back to the days when the internet just started to exist, and when the cloud just started to exist, we can trace the evolution of automating things and abstracting them away to make it easier for the end users. So in the case of data mesh and self-serving infrastructure, the idea is very similar. Provide a layer that abstracts way automates away most common operational data engineering tasks and workloads. The self-serve layer would take into account the common data agnostic parts of what’s most common in the data products and what can benefit from being automated way.
And this way the self-serve platform can not only provide a good data engineering infrastructure base, but also provide a baseline for things like security and interoperability, which brings us to another important concept. Federated governance. Traditionally, data governance in an organization is a process for ensuring high data quality through the complete life cycle of data. In Data Mesh with independent domain teams this doesn’t mean implementing the same mechanisms for achieving data governance. Instead, this means defining global standards around data governance and offering freedom to the individual teams to be able to reach these standards that are agreed upon. For example, their organization can globally agree on standards for security portability, interoperability, data, quality, reliability and so on. This allows for flexibility in decision-making and being able to make changes autonomously and still guarantee compliance and interoperability on the cross data product level.
Based on the Data Mesh concept, the drone delivery company decided to organize by domain teams and define data products. In reality, this type of company, you can have hundreds of data products, but for the purpose of this talk, we’ll focus on several. Let’s say, now the company has products with clear responsibilities, such as drones for working with drone devices, orders for processing online customer orders, shipping for coordination, between orders and drones, scheduling. Here, as you can see, orders data product is offering two outputs. One of them is a real-time stream of new orders as they come in, you can use something like KAFKA or a similar system. And another one is a weekly report of new orders, which could be for example, a SQL table or a JSON file or anything like that.
You can also see drones service, actually drone data products, accepting data from many input ports of different types, such as real time stream coming directly from devices or a stream of real-time shipments from a shipping data product or web requests from other services. And you can also see other data products like inventory planning, or routing analysis to respectively analyze and provide insights on what’s in demand and how the inventory would be updated. And for analyzing and optimizing drawn routes and maintenance. Here, their inputs are outputs of other domain data products. And as an output they offer data to and consumers in the face of inventory, department, logistics, department, and management,
When creating Data Products, it’s not enough to just define inputs, outputs, and actions according to the domain. In our example, for the drone delivery organization, the representatives in each department should also discuss and define the governance standards and go through the core properties of data mesh where each data product should be discoverable, addressable, trustworthy, secure, self-describing, and interoperable.
For discoverability, each new data product in the drone delivery organization can be registered in the data catalog, share information who owns the data, where the data is coming from, and other medical data. This can be done through an org wide, a Wiki system, or better yet something like a combination of a Wiki system and the data catalog to unite the technical functions of sharing data product, metadata, and a human friendly way to discover and learn more about the data product.
To make sure each data product is self-describing, there are presenters in each department can agree that each data product in the drone delivery org should at least offer a sample data sets. As an example, you can share data schemas, provide high quality and up-to-date documentation of the functionality input and output ports. And another useful thing here is to place data, schemas security policies, and component creation rules close to the same place where the data product source is located can definitely help make the product self-describing.
The organization representatives can also agree on the global convention that consumers can use to programmatically reach and access data products. And this could, for example, mean relying on a convention for accessing paths that are logically structured, as let’s say, sub-directories of a shared cloud storage service, like as three Azure Data Lake or Google Storage or any other convention, basically. The organization can also agree on a global security approach to make sure that data product has a defined service account or an identity account in the global access control system to provide a security boundary and effective policies with fine-grained permissions. They’re required for working with resources and other data products. Data products should also have defined and monitored SLO’s, Service Level Objectives, for things like availability, throughput, refresh frequency response time, and also provide an SLO indicating how accurate the data is compared to the events that actually happened in reality.
And this can be measured as an error rate, on a number of missing entries, percentage of non Percival data files or anything else they can accurately tell us whether the data is trustworthy for the end users of the data product. Providing data lineage as metadata also really helps tracing back to from errors to their root cause.
Another important aspect to define is which approaches can be used within the organization to make sure data products can be freely composable and globally interoperable and data can be passed as an input from one of the data products when it becomes its output. So this can be enabled by using open standards, by providing multiple mechanisms, to serve the data to consumers, according to what they need. This can mean agreeing on the communication standards, such as response and request, file formats structure, and other dataset conventions, or using commonly accepted formats and supporting them across all data products. Here is a cheat sheet that you can use to guide you through the questions you need to ask yourself when thinking about the products and the core principles. So now that we have a high level view of domains and data products with inputs and outputs and their relationships, we can try to understand which domain agnostic, operational functionality is common for a data product and can be abstracted away into their shared infrastructure platform. And to help us through this, we can zoom into which types of workloads, different data products perform.
Many of them interact with streaming data or incoming web requests. Then execute some type of stream processing or batch processing. Then they need to output data to some type of storage, like writing objects to a data lake or SQL tables or into data warehouse, or writing into a streaming sync. And in a drone delivery organization, this previous logical view can translate into this technological view, where each component is mapped to a particular technology. Or to another technical electrical view. And here all the logical components are the same, but the tech choices are different. And the point here is that what we want is to identify which operational components are required for data products in our organization, and understand if we can build a self-serve shared infrastructure that would be universal enough, that many data products can rely on it. Self-serve data platform can consist of things like data storage, compute products, data catalog, continuous delivery, data access across the system. And other shared products that data products can rely on. Self-serve buy from can also be a combination of these things, depending on what exactly did your product need as an abstraction
With the data products, defined data governance standards, and self-serve shared platform, our data mesh architecture could look something like this at the high level. And if you looked at the top title, you’re not now maybe thinking, so what are the open source tools that we can use to implement Data Mesh? So here is another important revelation add there isn’t a set list of products, projects, services, or tools that you should use to implement your Data mesh. Each organization is different with distinctive goals requirements, and you’ll see different views of which technologies they choose for each part of the data measure architecture. However, we will go through modern technologies and projects used in many data architectures, as well as emerging data products and projects and analyze which types of them are required. And which properties of these types of projects will make them a better candidate in a data mesh architecture or a more data mesh friendly option.
So when we look at the implementation, things are not as easy as we want them to be at this point in time. Decentralization of data ownership introduces the question of cost of data operations. Whether the data needs to be moved, how expensive might it be to pay for ingress and egress of data traveling between data products? Self-serve chariot platform also introduces the question of automation and across cloud API operations in a multi-cloud scenario. There is also a lack of end to end examples of data mesh implementations. Many of the techniques and tools used within a centralized architecture might not be optimal for a new decentralized architecture with a shared platform. So the shift might not be easy. And also there is a risk of not taking into account all of the important data mesh principles and ending up with not very well-functioning architecture. A lot of moving parts to take into account both organizationally and technically.
So each organization can have quite differently looking implementations, but there are a couple of things we can keep in mind when planning and thinking about technologies. One of the most obvious considerations is to evaluate how the products, services, projects for your data-workloads, support workload sharing, and multi-tenancy of workloads. This is important for ability to use these tools as a part of the shared data infrastructure. Another one is, does the technology you’re looking into support, no copy or zero copy data sharing? Does the technology support bringing compute to data located elsewhere, especially in a multicloud setting? This can help in being able to attach different types of compute inquiry engines to data, and it can enable different teams to interact with the data in their preferred ways without copying it or moving it. And this is a huge question for cost considerations. Another important one is what is the granularity in which you can set up and manage access and permissions within your technical choice?
Do you need a role level, column level, or a cell level access control? Can you manage it from other accounts, belonging to different organizations or accounts across cloud providers? And the other thing to consider is capabilities the technology provides for automation and extension, and this especially matters if you are planning to use it as a part of a shared infrastructure platform. Do common infrastructure as code tools, support working with this technology? And the next question to ask is does the technology support workload flexibility you need?, Especially if you’re going to rely on it as a part of self-serve platform, for example, does it support scale down to zero? How fast can scale to the maximum capability you’ll need.
Then there’s also a potential limitation, when they ran into most commonly with cloud providers or other platform providers, depending in which quotas and resource limits they have, which may affect our ability to use their technology within our self-serve platform. And the most important one. How does the technology you’re looking into support open standards, open protocols, and open source integrations? This will ultimately dictate how interoperable your system might be. So let’s look at examples of data mesh friendly technologies
An important note here did it mesh is not about specific technologies. The purpose of me sharing this is to show some of the useful projects that could be used in a data architecture, grouped into certain categories based on the types of operations and workloads that they enable. And it is crucial to understand your criteria and considerations for your own organizations, data products, and self-serve shared infrastructure and use these requirements to evaluate any products or projects. For example, when you’re looking at available data governance systems, you may consider whether or not they support the functionality you’re looking for related to metadata, data lineage, scamming, schemas, and so on. For example, Apache Atlas, may be one of the open source options that could be a good candidate for the drone delivery company. [inaudible] with open formats. Some of them are simple. Some of them are more advanced and depending on the use case, here are some of the criteria you should use to evaluate your choices. For example, Apache Iceberg has been gaining popularity in modern data architectures.
So this is a big one, data analysis and processing platforms. They have a lot of differences and very specific features that different organizations might be interested in. For example, Glady’s Park is the biggest example of interoperability and ecosystem that other open source projects can strive for. Prestio or Dremio are excellent examples of separation of compute and storage and things like that. There are very popular multicloud infrastructure management tools that everybody’s familiar with, like Terraform or Pulumi for declarative configuration, but they’re also tools like Crossplane that can help working across cloud providers and automate a vendor infrastructure creation that are also pretty interesting, could be useful. Another important area is multicloud workload portability. This is a new area and emerging tools like Azure Arc or Google Athnos. They can really help us bring the managed cloud workloads and compute to data anywhere. So this might be useful for Data Mesh as well.
Here, I intentionally included a note with an example of some projects that are open source and run on top of Kubernetes, which is in itself an open, open source extensible and interoperable platform. And it might be a good candidate for some of the Data Mesh components. Open application model, like a can help define a good base for data, product creation. It provides some abstractions that can simplify the process. Open policy agent is an open source project that provides a unified API for access management across various platforms. Service catalog can be used as a unified API for provisioning managed services from Kubernetes clusters. And there are more things like that, but I think these are good examples.
So, to summarize of benefits, the Data Mesh paradigm can bring to an organizations, are structuring teams to strive for better quality of data and services powering their business allocation of resources and people, according to prioritized domain teams, domain areas. It can improve organizational cohesion by coming together with central data standards, data governance and practices, and preserving the flexibility of each individual team working with data products. It can abstract away the complexity by offering a universal centralized infrastructure that others can build on. It can really lower the barrier for domain teams to create new data products. Also, making it easier to roll out changes and date replace, delete, and evolve data products. Another point is that data product outcomes and metadata can provide measurable insights for business to understand where the value lies and help form the vision for areas of future innovation and demand. Here are some important focus areas for technology providers because they will be affecting decision-making criteria for technologists that are going to make decisions for their organizations
While, I don’t think everybody should switch to a decentralized architecture like Data Mesh. I do believe that the fact that Data Mesh concept exists will positively impact the data industry. Since Data Mesh can be helpful, especially for larger and quickly growing organizations and teams, they will strive to choose and build tools that follow Data esh core principles and standards. And in turn, this will certainly be noticed by cloud providers and product providers and will drive interoperability, open standards and data quality in the industry and in the ecosystem of data, products and tools.
Thank you so much for watching the talk. Please share the challenges you ran into your data architectures. You can find me online on Twitter and I will be happy to answer any questions.

Lena Hall

Lena Hall is a Director of Engineering at Microsoft working on Azure, where she focuses on large-scale distributed systems, modern architectures. She is leading an advocacy team and technical strategy...
Read more