Tao Feng is an engineer at Databricks. Tao is the co-creator of Amundsen, an open source data discovery and metadata platform project, and a committer and PMC of Apache Airflow. Previously, Tao worked at Lyft, LinkedIn and Oracle on data infrastructure, tooling, and performance.
May 27, 2021 03:15 PM PT
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
November 18, 2020 04:00 PM PT
Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
In short, Amundsen is built on 3 key pillars:
1. Augmented Data Graph: Amundsen uses a graph database(Neo4j by default) under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What's unique to Amundsen is that we bring all related metadata (usage, last updated, watermark, stats, etc) into this graph. One example is that we also treat people as a first-class data asset – in other words, there's a graph node for each person in the organization that connects to other nodes (like tables, and dashboards). This solves interesting problems such as ramping up problems by answering “what my team member’s frequently used table”?
2. Intuitive User Experience: Amundsen strives to deliver data discovery relevant to the user by running PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.
3. Centralized Metadata from different sources: Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress. It also provides the data lineage across different sources and allows users to understand the data connection.
In this talk, we will discuss what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. Then we will deep dive into Amundsen's architecture, discuss how it achieves the 3 discussed design pillars. More importantly, we will discuss how Amundsen could be customized and extended to other companies’ data ecosystem. Lastly, we will close with the future roadmap of the project, what problems remain unsolved, and how we can work together to solve them.
Speaker: Tao Feng