Tianru Zhou

Software Engineer, Databricks

Tianru Zhou is currently working at Databricks on data discovery related projects, including integrating Amundsen with existing infrastructure to do data discovery. Previously, he worked at AWS Elasticsearch on the storage layer development for UltraWarm.

Past sessions

Summit 2021 Data Discovery at Databricks with Amundsen

May 27, 2021 03:15 PM PT

Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.

We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:

  • Surface the most popular tables used within Databricks
  • Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
    • Lineage information (downstream table, upstream table, downstream jobs, downstream users)
    • Dataset owner
    • Dataset frequent users
    • Delta extend metadata (e.g change history)
    • ETL job that generates the dataset
    • Column stats on numeric type columns
    • Dashboards that use the given dataset
  • Use Databricks data tab to show the sample data
  • Surface metadata on dashboards including: create time, last update time, tables used, etc

Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.

In this session watch:
Tao Feng, Engineer, Databricks
Tianru Zhou, Software Engineer, Databricks