How Databricks Unity Catalog Helped Amgen Enable Data Governance at Enterprise Scale

Published: July 5, 2023

This blog authored post by Jaison Dominic, Senior Manager, Information Systems at Amgen, and Lakhan Prajapati, Director of Architecture and Engineering at ZS Associates.

Amgen, the world's largest independent biotech company, has long been synonymous with innovation. For 40 years, we've pioneered new drug-making processes and developed life-saving medicines, positively impacting the lives of millions around the world.

Data and AI are pivotal to our business strategy. Recognizing the abundance of data within our enterprise, our vision was to establish a data-driven organization where data analytics is made accessible through self-service governance capabilities. In our pursuit of modernization, we carefully selected the Databricks Lakehouse Platform as the bedrock of our digital transformation journey. This strategic decision has enabled us to unlock the true potential of our data and AI across various departments, resulting in streamlined operational efficiency and accelerated drug discovery. As we continuously enrich our data lake with diverse domains, including restricted and sensitive data, our impact expands even further.

Furthermore, we recognized the need for enhanced data governance to complement our efforts. Our previous data governance solution proved complex, challenging to manage, and lacked fine-grained access control. To address these obstacles and facilitate widespread adoption of our governance capability within the enterprise, we have recently integrated the Databricks Unity Catalog into our governance processes. This integration represents a significant milestone in our journey, bolstering data governance by providing a robust solution that is both user-friendly and simplifies management while offering granular access control.

Today, we are sharing our progress and success so far in the hopes that others can learn from our journey and apply it to their own business strategies.

Using IAM roles for governance was difficult to manage and lacked fine-grained access controls

Amgen operates within a highly regulated industry where compliance is the cornerstone of our operations. We recognize the critical importance of proper governance and auditability for any restricted or sensitive data. Data democratization was the original objective of our Enterprise data lake initiative, ensuring that all Amgen users have access to the available data. However, the inclusion of sensitive data in the data lake highlighted the need for more robust data access governance.

Previously, we relied on AWS Glue as an enterprise data catalog and AWS's identity and access management (IAM) for role-based access controls. This involved creating separate IAM roles and associating them with specific clusters to cater to unique use cases. However, managing numerous groups and their associated cluster resources independently posed significant challenges. Moreover, IAM roles only governed access to storage, leaving metadata accessible to all. The absence of fine-grained access controls made auditing a complex task, hindering our ability to audit data access and executed queries effectively.

To address these challenges, we recognized the need to transition to user-level access and user attribute-based access controls. For example, users would be assigned attributes such as cost centers, and data within Finance would be controlled based on the assigned cost center. However, implementing user-attribute-based access control through IAM roles would have required the creation of a vast number of roles, posing a significant management burden.

We evaluated several off-the-shelf governance tools. While some of the tools met immediate requirements, such as managing tables at the database level, they proved inadequate for highly restricted data domains like EDW (Finance) and Workday (HR). Moreover, we had concerns about bypassing these tools on the Databricks cluster, creating potential vulnerabilities and ensuring comprehensive coverage across all clusters, and scaling the solution. Additionally, maintaining plugins on selective clusters posed challenges in terms of script consistency and ongoing maintenance.

Migrating to Unity Catalog simplified access management and eliminated noncompliance and security incidents

Currently, 90 percent of our use cases are on Databricks. Given that, we felt we needed a Databricks native governance solution for the long term. To begin moving in that direction, we turned to Unity Catalog.

Adopting the Unity Catalog resulted in several immediate benefits.

First, we didn't have to create or manage at least 120+ IAM roles. We can control access through Unity Catalog and the APIs Unity Catalog provides. Everything is managed through access control lists (ACLs) or dynamic views. As a result, we went from hundreds of IAM roles to just one or two principal IAM role.
The second benefit we realized is easy auditability. Editing Unity Catalog ACLs is much easier than parsing IAM policies and then identifying who has what access. This reduces the audit effort for the function by 50%. The query history gives us the ability to see who accessed what data at what point in time.
Unity Catalog is easy to manage. It's allowed us to move away from dedicated cluster-based access to a shared cluster pool with the user and role-based access controls, reducing Databricks cost by 10-20%.
It unifies everything at a central place and enables seamless cross-functional data analytics and the tight integration with the Databricks ecosystem provides true differentiation.

Currently, we have around ~500 objects mapped in Unity Catalog (and growing) and governed through its ACLS. Since moving to Unity Catalog we've much higher confidence in our data governance and adherence to compliance. Once we start onboarding more functions, we anticipate these benefits to multiply.

Building further on our Databricks Unity Catalog success

This is only the initial stage of our journey. We have a bigger vision ahead and are diligently crafting a strategy that will propel us toward our goal of migrating the majority of our data assets from AWS Glue to the Unity Catalog. As our enterprise data landscape encompasses numerous data domains, thousands of databases, and millions of objects, Unity Catalog is poised to become our default catalog. This strategic shift will streamline and unify our data ecosystem, enabling seamless management and exploration of our extensive data resources.

We'll use Unity Catalog's data lineage features to enhance observability, build confidence in our data creation, and track sensitive data usage across our data estate. Additionally, we're enthusiastic about utilizing Delta Sharing in Unity Catalog for external data sharing. While we currently share data internally, we're actively exploring the collection and sharing of external data with multiple vendors through Delta Sharing.

In conclusion, the integration of the Unity Catalog has enhanced our ability to implement precise and intricate governance policies for Amgen's restricted data sets, including Finance and Workday. This remarkable achievement has sparked immense enthusiasm within our data engineering department, leading to increased investment in our data platform, with Unity Catalog serving as the central Metastore and access management service. Looking ahead to the next year, we anticipate that Unity Catalog will facilitate over 80% of application data consumption at Amgen, benefiting our vast user base of over 10,000 active users. With this shift, we are poised to achieve efficiency improvements of 60-80% in auditing and access management, firmly positioning our company for success as we continue to expand our analytics offerings.

Watch our presentation at Data and AI Summit 2023 to learn more.

What's next?

December 9, 2024/6 min read

Scale Faster with Data + AI: Insights from the Databricks Unicorns Index

January 2, 2025/6 min read

Using IAM roles for governance was difficult to manage and lacked fine-grained access controls

Migrating to Unity Catalog simplified access management and eliminated noncompliance and security incidents

Building further on our Databricks Unity Catalog success

Never miss a Databricks post

Sign up

What's next?

Scale Faster with Data + AI: Insights from the Databricks Unicorns Index

How HP is optimizing the 3D Printing supply chain using Delta Sharing