Fueling materials science innovation with unified data
Corning achieves end-to-end data orchestration with Databricks Workflows
Data unified by Databricks
Jobs running
Active Databricks users
Corning is a Fortune 500 manufacturing company founded in 1851. The organization focuses on glass science, optical physics and ceramic science and provides innovative products to the life sciences, mobile consumer electronics, optical communications, display and automotive markets. As an innovator in materials science, Corning relies on data to fuel inventions and patents. But with 50,000 employees and numerous manufacturing plants spread across the globe, siloed data became one of the company’s biggest challenges, resulting in more than 400 data repositories and one billion Excel files. Corning sought to combine all their data and create a centralized data platform where analysts, data engineers and machine learning scientists could conduct self-service data exploration and glean insights faster. To accomplish that, Corning turned to the Databricks Data Intelligence Platform and Databricks Workflows.
Legacy data warehousing and isolated on-premises data systems create siloed data challenges
Corning collects data from hundreds of sources globally, and new sources emerge almost daily. The company’s legacy data warehousing solution could no longer keep pace, and analytics activities were cumbersome. Utilizing on-premises cubes that connected to a Power BI model, Corning data scientists landed data into a legacy data warehouse, performed ETL within that cube and connected it to Power BI. This expensive, time-consuming process resulted in more siloed data. The company also struggled with compute power, scalability and performance.
Jibreal Hamenoo, Principal System Engineer, Data Engineering at Corning, wanted to create a central data source to enable data engineers, machine learning engineers and data scientists to self-serve. Ultimately, this would help Corning achieve several key business outcomes, including finance optimization and the ability to better predict demand from its manufacturing clients, forecasting plant maintenance, and using tools like image analysis to conduct defect analysis. “To accomplish all that over 400 different datasets with different teams was a big challenge,” said Hamenoo.
Building a unified platform on the Databricks Data Intelligence Platform
Corning turned to the Databricks Data Intelligence Platform to centralize their data sources. “Databricks allows us to bring all our data into the platform in all different formats,” shared Hamenoo. “Users can see data from different silos in one place, access it, refine it, curate it and add value to it.”
The Data Intelligence Platform also provides Corning the compute power they need to derive insights faster. “Instead of a data scientist loading data locally into a client, they can go into the Databricks workspace, stand up an instance of the right size and plow through billions of data points and make inferences quickly,” added Hamenoo. “That’s a big advantage to a company that relies on innovation.”
Corning uses Databricks Workflows to execute multiple data pipeline tasks, enabling full medallion pipeline orchestration. Raw data ingested from various sources is brought into a Bronze layer. Data engineers can then act on that data, refine it and build Silver or Gold tables. After a developer, data engineer or data scientist has worked with the data and tested it, Workflows enables Corning to establish an automated, repeatable process for data pipelines. “Databricks Workflows plays a critical role in allowing us to repeatedly, on our own schedules, run a whole pipeline orchestration, with end-to-end data flow through the Data Intelligence Platform,” explained Hamenoo.
Corning engineers previously used Apache Airflow as the data orchestration tool but have now fully migrated to Workflows. Corning significantly transformed their data and AI capabilities by leveraging several Databricks products, including Databricks Workflows, Delta Live Tables (DLT), Unity Catalog and Databricks Assistant.
Transforming raw data into insights
Using the Databricks Data Intelligence Platform, Corning now runs approximately 2,500 different jobs (about 5 petabytes of data). Their self-serve platform is currently used by about 900 active users spread across the business globally. Databricks Workflows helped Corning simplify data orchestration.
Before adopting Databricks Workflows, Corning had to rely on a cumbersome and manual process to gather information, painstakingly piecing together data from CSV files pushed from their accounts into an Amazon S3 bucket. The process was not only time-consuming but also prone to errors. With Workflows, the entire process became seamless. The automation and integration capabilities of Workflows eliminated the need for manual data compilation, allowing Corning to focus on deriving actionable insights. As Hamenoo put it, “Workflows makes my life easier.“
Data scientists now orchestrate once and can quickly reuse that orchestration across multiple workspaces or domains. They can also leverage custom library capabilities. “If you’re working in a notebook environment where data processing features require custom libraries, Workflows enables us to do that,” said Hamenoo. “We can install or assign custom libraries to our workflows, and that capability is available to every object within the pipeline. That’s a great benefit to us.”
Workflows is also helping Corning quantify resource usage and costs and quickly identify optimization opportunities. “That enables us to reduce costs and enhance workload performance,” added Hamenoo. “Workflows also allows us to observe the full lineage between our Bronze staging environment through our Silver and Gold medallion architecture.”
Additionally, Workflows improved observability, providing Corning data scientists an opportunity to be proactive and catch problems before they become larger issues, while the ability to repair and rerun subsets of directed acyclic graphs (DAGs) allows them to find the cause of a failure without having to restart an entire workflow. Finally, using Databricks enhances Corning’s compute performance, enabling them to scale production data volumes quickly.
Simplifying governance with Unity Catalog
Corning decided to migrate all their workspaces to Unity Catalog due to its ease of use and robust governance capabilities. They previously relied on Privacera for access management and role-based management but found Unity Catalog more advantageous for several reasons. The ability of Unity Catalog to govern various types of data, including ingested, raw, unstructured, structured and SQL code, played a crucial role in their decision. Additionally, the seamless integration of Databricks Assistant with Unity Catalog was a significant benefit.
One of the primary challenges Corning faced prior to adopting Unity Catalog was the cumbersome process of daily metadata migration or cloning from upper environments to provide raw data for development purposes. This process was time-consuming and complex, often resulting in data duplication across different workspaces. With Unity Catalog, Corning can now bind workspaces and create catalogs accessible from non-production environments for read and write operations, eliminating the need for data duplication. Additionally, harnessing the power of System Tables, Corning built an advanced user metric dashboard that offered a clear view of critical metrics, including the total number of active users in the Databricks system, total daily active users and the cost consumption of Databricks Units (DBUs) by user and division. This level of visibility was a game changer for Corning.
Looking ahead with Databricks Assistant, Lakehouse Federation and serverless compute
“The introduction of Databricks Assistant has truly impressed me. I no longer have to write code. What used to take me one hour to write I did in five minutes. From the advanced users to the basic users at Corning, everyone is amazed by the immediate impact,” stated Hamenoo. At Corning, Databricks Assistant not only saved significant time but also empowered basic users who are not formally trained as programmers and may not consider themselves coders to effectively engage in data engineering tasks.
Corning is previewing Lakehouse Federation, which allows them to access data using Google BigQuery without the need for ingestion, further simplifying their data management processes.
Data teams have also started experimenting with serverless compute for Workflows. Hamenoo summarized, “What I’m most looking forward to for serverless Workflows for myself and my team is no longer having to spend time overseeing clusters, policies and upgrades. This shift to serverless will free up valuable bandwidth that was previously dedicated to platform administration.”