CUSTOMER
STORY

Fueling materials science innovation with unified data

Corning achieves end-to-end data orchestration with Databricks Lakeflow Jobs

~5 petabytes

Data unified by Databricks

2,500

Jobs running

900

Active Databricks users

Watch video

Product descriptions:

Lakeflow Jobs Unity Catalog

Corning is a Fortune 500 manufacturing company founded in 1851. The organization focuses on glass science, optical physics and ceramic science and provides innovative products to the life sciences, mobile consumer electronics, optical communications, display and automotive markets. As an innovator in materials science, Corning relies on data to fuel inventions and patents. But with 50,000 employees and numerous manufacturing plants spread across the globe, siloed data became one of the company’s biggest challenges, resulting in more than 400 data repositories and one billion Excel files. Corning sought to combine all their data and create a centralized data platform where analysts, data engineers and machine learning scientists could conduct self-service data exploration and glean insights faster. To accomplish that, Corning turned to the Databricks Data Intelligence Platform and Databricks Lakeflow Jobs.

Legacy data warehousing and isolated on-premises data systems create siloed data challenges

Corning collects data from hundreds of sources globally, and new sources emerge almost daily. The company’s legacy data warehousing solution could no longer keep pace, and analytics activities were cumbersome. Utilizing on-premises cubes that connected to a Power BI model, Corning data scientists landed data into a legacy data warehouse, performed ETL within that cube and connected it to Power BI. This expensive, time-consuming process resulted in more siloed data. The company also struggled with compute power, scalability and performance.

Jibreal Hamenoo, Principal System Engineer, Data Engineering at Corning, wanted to create a central data source to enable data engineers, machine learning engineers and data scientists to self-serve. Ultimately, this would help Corning achieve several key business outcomes, including finance optimization and the ability to better predict demand from its manufacturing clients, forecasting plant maintenance, and using tools like image analysis to conduct defect analysis. “To accomplish all that over 400 different datasets with different teams was a big challenge,” said Hamenoo.

Building a unified platform on the Databricks Data Intelligence Platform

Corning turned to the Databricks Data Intelligence Platform to centralize their data sources. “Databricks allows us to bring all our data into the platform in all different formats,” shared Hamenoo. “Users can see data from different silos in one place, access it, refine it, curate it and add value to it.”

The Data Intelligence Platform also provides Corning the compute power they need to derive insights faster. “Instead of a data scientist loading data locally into a client, they can go into the Databricks workspace, stand up an instance of the right size and plow through billions of data points and make inferences quickly,” added Hamenoo. “That’s a big advantage to a company that relies on innovation.”

Corning uses Databricks Lakeflow Jobs to execute multiple data pipeline tasks, enabling full medallion pipeline orchestration. Raw data ingested from various sources is brought into a Bronze layer. Data engineers can then act on that data, refine it and build Silver or Gold tables. After a developer, data engineer or data scientist has worked with the data and tested it, Lakeflow Jobs enables Corning to establish an automated, repeatable process for data pipelines. “Databricks Lakeflow Jobs plays a critical role in allowing us to repeatedly, on our own schedules, run a whole pipeline orchestration, with end-to-end data flow through the Data Intelligence Platform,” explained Hamenoo.

Corning engineers previously used Apache Airflow as the data orchestration tool but have now fully migrated to Lakeflow Jobs. Corning significantly transformed their data and AI capabilities by leveraging several Databricks products, including Databricks Lakeflow Jobs, Spark Declarative Pipelines, Unity Catalog and Databricks Assistant.

Transforming raw data into insights

Using the Databricks Data Intelligence Platform, Corning now runs approximately 2,500 different jobs (about 5 petabytes of data). Their self-serve platform is currently used by about 900 active users spread across the business globally. Databricks Lakeflow Jobs helped Corning simplify data orchestration.

Before adopting Databricks Lakeflow Jobs, Corning had to rely on a cumbersome and manual process to gather information, painstakingly piecing together data from CSV files pushed from their accounts into an Amazon S3 bucket. The process was not only time-consuming but also prone to errors. With Lakeflow Jobs, the entire process became seamless. The automation and integration capabilities of Lakeflow Jobs eliminated the need for manual data compilation, allowing Corning to focus on deriving actionable insights. As Hamenoo put it, “Lakeflow Jobs makes my life easier.“

Data scientists now orchestrate once and can quickly reuse that orchestration across multiple workspaces or domains. They can also leverage custom library capabilities. “If you’re working in a notebook environment where data processing features require custom libraries, Lakeflow Jobs enables us to do that,” said Hamenoo. “We can install or assign custom libraries to our workflows, and that capability is available to every object within the pipeline. That’s a great benefit to us.”

Lakeflow Jobs is also helping Corning quantify resource usage and costs and quickly identify optimization opportunities. “That enables us to reduce costs and enhance workload performance,” added Hamenoo. “Lakeflow Jobs also allows us to observe the full lineage between our Bronze staging environment through our Silver and Gold medallion architecture.”

Additionally, Lakeflow Jobs improved observability, providing Corning data scientists an opportunity to be proactive and catch problems before they become larger issues, while the ability to repair and rerun subsets of directed acyclic graphs (DAGs) allows them to find the cause of a failure without having to restart an entire workflow. Finally, using Databricks enhances Corning’s compute performance, enabling them to scale production data volumes quickly.

Simplifying governance with Unity Catalog

Corning decided to migrate all their workspaces to Unity Catalog due to its ease of use and robust governance capabilities. They previously relied on Privacera for access management and role-based management but found Unity Catalog more advantageous for several reasons. The ability of Unity Catalog to govern various types of data, including ingested, raw, unstructured, structured and SQL code, played a crucial role in their decision. Additionally, the seamless integration of Databricks Assistant with Unity Catalog was a significant benefit.

One of the primary challenges Corning faced prior to adopting Unity Catalog was the cumbersome process of daily metadata migration or cloning from upper environments to provide raw data for development purposes. This process was time-consuming and complex, often resulting in data duplication across different workspaces. With Unity Catalog, Corning can now bind workspaces and create catalogs accessible from non-production environments for read and write operations, eliminating the need for data duplication. Additionally, harnessing the power of System Tables, Corning built an advanced user metric dashboard that offered a clear view of critical metrics, including the total number of active users in the Databricks system, total daily active users and the cost consumption of Databricks Units (DBUs) by user and division. This level of visibility was a game changer for Corning.

Looking ahead with Databricks Assistant, Lakehouse Federation and serverless compute

“The introduction of Databricks Assistant has truly impressed me. I no longer have to write code. What used to take me one hour to write I did in five minutes. From the advanced users to the basic users at Corning, everyone is amazed by the immediate impact,” stated Hamenoo. At Corning, Databricks Assistant not only saved significant time but also empowered basic users who are not formally trained as programmers and may not consider themselves coders to effectively engage in data engineering tasks.

Corning is previewing Lakehouse Federation, which allows them to access data using Google BigQuery without the need for ingestion, further simplifying their data management processes.

Data teams have also started experimenting with serverless compute for Lakeflow Jobs. Hamenoo summarized, “What I’m most looking forward to for serverless Lakeflow Jobs for myself and my team is no longer having to spend time overseeing clusters, policies and upgrades. This shift to serverless will free up valuable bandwidth that was previously dedicated to platform administration.”