How Delta Sharing Helped Rearc Simplify Data Sharing and Maximize the Business Value of Its Data
This is a guest authored post by Daniel Barrundia, Head of Data Engineering, and Dara Kharabi, Product lead, at Rearc
Rearc is a cloud and data company that provides best-in-class data sourcing, licensing, and transformation, as well as powerful data solutions and consulting services. With over 450+ open curated data products available across different sectors, Rearc's cross-industry catalog of datasets is one of the largest available today.
Today, we are extremely excited to announce that the Rearc Data Library is now available via Delta Sharing and that Rearc's suite of data products will be available in the upcoming Databricks Marketplace. Customers using Databricks and Delta Sharing will now be able to access Rearc data more seamlessly than ever before, unlocking faster insights.
Databricks Delta Sharing provides an open solution to securely share live data from your lakehouse to any computing platform. This technology allows our customers to securely access Rearc's curated data tables natively just like they would query their own data lakehouse. We believe Delta Sharing provides a wide range of benefits for both data providers (like Rearc) and data consumers (like our customers). In this blog, we will outline the benefits we are seeing from leveraging Delta Sharing and how we got started with the technology.
Faster customer onboarding and time to value with Databricks Delta Sharing
At Rearc, we manage, update, and deliver an extensive library of data products to our subscribers. Our customers have a wide range of requirements, and we have adopted several different data delivery channels to support their needs. When we evaluate new data-sharing technologies, we ask a few key questions:
- How does this data-sharing technology make our customer experience more effortless?
- What new use cases does this technology unlock?
- Which customers would benefit?
Delta Sharing is especially interesting to us because it addresses several new customer use cases while providing a better experience to customers already using Databricks or other data sharing technology. The ability to seamlessly query Rearc delta shares from within Databricks is a huge lift to the customer onboarding experience since customers can use the tools they're already familiar with to access our data instantly. Because Delta Sharing queries always reflect the latest data, we see it as a way to share real-time data that would otherwise necessitate the creation of an API. Finally, Delta Sharing has unique benefits for our multi-cloud customers since the same interface may be used to query data across clouds.
Rearc's Experience with Delta Sharing
Rearc has been using Delta Sharing for several months now. After a very quick and easy set-up process (more details below), we enabled Delta Sharing for many of our most popular data products and brought them to customers and prospects. So far, Delta Sharing has let us realize many benefits for ourselves and our customers, and the technology has helped us alleviate many of our existing challenges.
Benefits for data consumers
- Faster onboarding for customers: Delta Sharing allowed us to make the new customer experience effortless. Under the traditional data ETL model, a Rearc customer would have to set up an ingestion pipeline to load new versions of the data into their data warehouse or analytics environment before being able to use the data. With Delta Sharing, customers can access the data through their preferred tools with just a set of credentials. If the data consumer is a Databricks customer, the share is set up automatically and ready to go with a few clicks. If the consumer is not a Databricks customer, they are sent a link to a file download containing their unique credentials, allowing them to consume the data through any Delta Sharing enabled client (PowerBI, Pandas, etc.). No longer having to perform these additional steps has let us lower onboarding time by almost 70% in many cases.
- Access to fresh, ready-to-query data: Delta Sharing avoids data staleness issues that our customers experienced in the past when their version of the data didn't reflect the latest version we had published. When our customers query their Delta Shares, they always receive the latest version of the data directly from our data lake.
- Open and cross-cloud data access: Customers with multicloud technology stacks are especially happy with Delta Sharing. They can subscribe to a data product once and query it from endpoints on any cloud (and on-prem). This is a huge benefit to organizations that are distributed across multiple clouds. Last and certainly not least, Delta Sharing has already developed a robust list of native connectors to provide data recipients the flexibility to leverage their existing tools. Some of our customers leverage Delta Sharing directly in their applications using these integrations such as Power BI.
Benefits for Rearc
- No data replication: Delta Sharing does not require replication of our source data. Our data sets are already stored in the Parquet format on Amazon S3, allowing us to convert them to Delta Lake and enable sharing without incurring additional expensive and effort-intensive data load processes.
- Cross-cloud data sharing: We can share data more easily across clouds with Delta Sharing. While many of our data sets are stored in Amazon S3, our customers can query this data natively from their Azure environment without any special modifications. Data egress fees are still a consideration, but we no longer worry about maintaining ETL pipelines to replicate our S3 data sets in Azure Blog Storage or Google Cloud Storage.
- Centralized auditing of the shared data: Understanding how our customers use our data allows us to build a better experience for them in the future and refine our strategy for future data offerings. Databricks offers very good auditing features that let us understand how our customers interact with our data and gather valuable insights, whether it's understanding which tables are queried most or determining whom to inform in the event of a data anomaly. This auditing feature is useful in security, compliance, and risk mitigation scenarios.
Getting started with Delta Sharing
In this section, we walk you through how Rearc got started with Delta Sharing. This also serves as a tutorial for any data team looking to get started with Delta Sharing.
The data provider journey
Getting started with Delta Sharing was easy. All we had to do was convert our data into Delta Format from Parquet, create the Delta Shares, and assign access permissions using Unity Catalog. We already have a Databricks environment set up on AWS, and our data sits in Amazon S3, so getting our first Delta Share set up took less than 10 minutes from start to finish. Here's how we did it:
We leveraged the Databricks SQL Data Explorer to manage our tables. Using Databricks you can:
- Create an External Table from your storage (e.g. Amazon S3, Azure Blob Storage, etc)
- Manage your Delta Shares (e.g. adding tables or recipients to the shares)
For this blog, we will use the Databricks web UI. All of this can be done through SQL or the Data Sharing API as well.
1. Add tables to Delta shares.
Shares can be thought of as containers or packages of commonly bundled tables that can be shared out to multiple recipients.
2. Add recipients to Delta Shares.
To add a recipient:
- If they're an existing Databricks user, we use the sharing identifier provided by the consumer to automatically establish the share.
- If they're not an existing Databricks customer, we leave the sharing identifier blank and upon creation we are provided with an activation link that we then send to the recipient.
The data consumer journey
Our data consumers can access the data we publish within minutes using either an access token or directly with Databricks. If you would like to try this out, follow the steps below:
Access using a token
- Get access to the data with a credentials file. You can request access to free Rearc data here - alternatively, you can try the sample data provided by Databricks. This will give you a profile file with your credentials.
- Save your profile file somewhere where you can access it. If you want to access the data locally, youcan place the profile file on the local file system.. If you would like to access the data from the cloud, you should upload the file to cloud storage (Amazon S3, Azure Blob Storage, DBFS, etc) Create a URL to access a shared table. A table path is the path to the profile file, followed by # and the fully qualified name of the table: `
# . . ` - Use the connector of your choice to query the data (Pandas, Spark, Databricks SQL, etc.).
Access Delta Sharing within Databricks (easiest method for Databricks users)
- Provide workspace ID to the data provider. If you would like to try this method, you can fill out the form here with your workspace ID and Rearc will give you access to a delta share with free sample datasets.
- The data provider will share the datasets with you.
- The shared data will appear in the 'Shared with me' section in the data explorer.
- Recipients will click on the 'create a catalog' button which will allow the new data to appear within data explorer like any other schema.
Rearc empowers Data Providers
Rearc's unique capabilities and deep experience working across multiple data platforms position us well to help accelerate onboarding and adoption of Databricks and Delta Sharing for data providers. Rearc enables data providers in all steps of the journey:
- Data Cloud Migration: Rearc can help you move your on-premises data services to the cloud, creating operational efficiencies, lowering costs, and making it easier for your customers to access data. We can help you move traditional data warehouses and RDBMSs to Delta Sharing and evolve your Data Lakehouse.
- Data Preperation and ETL: Preparing and sharing data requires a lot of ETL, data cleaning, and data transformation. Rearc's Data Orchestration experts can help you architect and implement your dataflows. For organizations looking for a turnkey solution, Rearc's Data Platform offers a fully managed option for providing data via Delta Sharing.
- Provider Insights: As a data provider, we know that product analytics is key to prioritization, product design, and long-term success. We can help you with analytics solutions to track and understand delta share usage, surfacing which products customers are most engaged with, and what to prioritize.
What's Next?
Rearc is excited to be a data solutions partner on the upcoming Databricks Marketplace. We will provide our data products to customers on the marketplace and share our expertise with other data providers with our Provider Enablement Services. Delta Sharing powers the Databricks Marketplace, so it comes with all the benefits discussed above.
A marketplace feature we are especially excited about is the ability to share assets such as notebooks, models, and dashboards alongside our data. As a data provider, we strive to provide the best experience for analysts and data scientists using our data for analytics and machine learning pipelines. Because of this, many of Rearc's data solutions go above and beyond just datasets: we include pre-built interactive dashboards and notebooks illustrating common use cases. The Databricks Marketplace provides an easy way to share these assets alongside our data, allowing our customers to combine data and analytics more efficiently.
We are excited to be on the cutting edge of data sharing with Delta Sharing and Databricks Marketplace. With the help of these technologies, we continue to further our mission of eliminating the heavy lifting involved with sourcing, transforming, and curating data so our customers can focus on the valuable data-driven insights that matter to their business.
Resources
Access Rearc data with Delta Sharing.
Watch the recent Rearc webinar on Delta Sharing.
Watch the demo to see Delta Sharing in action.