A large Delta Lake frequently includes a mix of structured and unstructured data. Data teams use Apache SparkTM to analyze structured data, but often struggle to apply the same analysis to unstructured, unlabeled data (e.g. images, video). Teams are forced to use expensive and manual processes to transform unstructured data into something more useful –they either pay a third party to label their data, buy a labeled dataset, or narrow the scope of their project to leverage public datasets. If data teams had faster and more cost effective ways to convert unstructured data into structured data, they could support more advanced use-cases built around their companies’ unique, unstructured datasets.
In this talk, we demonstrate how teams can easily prepare unstructured data for AI and analytics in Databricks. We leverage the LabelSpark library (a connector between Databricks and Labelbox) to connect an unstructured dataset to Labelbox, programmatically set up an ontology for labeling, and return the labeled dataset in a Spark DataFrame. Labeling can be done by humans, AI models in Databricks, or a combination of both. We will also show a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels. This process can reduce the amount of unstructured data you need to achieve strong model performance.
Labelbox is a training data platform that allows companies to quickly produce structured data from unstructured data. Combining Databricks and Labelbox gives you an end-to-end environment for unstructured data workflows –a query engine built around Delta Lake, fast annotation tools, and a powerful Machine Learning compute environment.
To learn more, visit www.labelbox.com/databricks-partner
Nick Lee is a Senior Customer Success Manager at Labelbox where he helps AI teams solve challenging problems in computer vision and natural language processing. Nick also leads the LabelSpark project,...
Christopher Amata is a Solutions Engineer at Labelbox where he designs and deploys technical solutions for AI teams. He is also a lead developer for the LabelSpark project, a Labelbox initiative to ac...