Tao Feng

Engineer, Databricks

Tao Feng is an engineer at Databricks. Tao is the co-creator of Amundsen, an open source data discovery and metadata platform project, and a committer and PMC of Apache Airflow. Previously, Tao worked at Lyft, LinkedIn and Oracle on data infrastructure, tooling, and performance.

Past sessions

Summit 2021 Data Discovery at Databricks with Amundsen

May 27, 2021 03:15 PM PT

Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.

We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:

  • Surface the most popular tables used within Databricks
  • Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
    • Lineage information (downstream table, upstream table, downstream jobs, downstream users)
    • Dataset owner
    • Dataset frequent users
    • Delta extend metadata (e.g change history)
    • ETL job that generates the dataset
    • Column stats on numeric type columns
    • Dashboards that use the given dataset
  • Use Databricks data tab to show the sample data
  • Surface metadata on dashboards including: create time, last update time, tables used, etc

Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.

In this session watch:
Tao Feng, Engineer, Databricks
Tianru Zhou, Software Engineer, Databricks

[daisna21-sessions-od]

Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.

In short, Amundsen is built on 3 key pillars:

1. Augmented Data Graph: Amundsen uses a graph database(Neo4j by default) under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What's unique to Amundsen is that we bring all related metadata (usage, last updated, watermark, stats, etc) into this graph. One example is that we also treat people as a first-class data asset – in other words, there's a graph node for each person in the organization that connects to other nodes (like tables, and dashboards). This solves interesting problems such as ramping up problems by answering “what my team member’s frequently used table”?

2. Intuitive User Experience: Amundsen strives to deliver data discovery relevant to the user by running PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.

3. Centralized Metadata from different sources: Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress. It also provides the data lineage across different sources and allows users to understand the data connection.

In this talk, we will discuss what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. Then we will deep dive into Amundsen's architecture, discuss how it achieves the 3 discussed design pillars. More importantly, we will discuss how Amundsen could be customized and extended to other companies’ data ecosystem. Lastly, we will close with the future roadmap of the project, what problems remain unsolved, and how we can work together to solve them.

Speaker: Tao Feng