Building a Multimodal Data Lakehouse with the Daft Distributed Python Dataframe
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Lightning Talk |
TRACK | Data Engineering and Streaming |
TECHNOLOGIES | Developer Experience, ETL, Orchestration |
SKILL LEVEL | Intermediate |
DURATION | 20 min |
Modern data workloads come in all shapes and sizes - numbers, strings, JSONs, images, whole PDF textbooks and more. To process this data we still rely on utilities such as: ffmpeg for videos, jq for JSON and Pytorch for tensors.
However, these tools were not built for large-scale ETL. This means that we often need to build bespoke data pipelines that orchestrate data movement and custom tooling. If only downloading images, resizing them and running vision models was as simple as extracting a substring in SparkSQL!
Daft (https://www.getdaft.io) is a next-generation distributed query engine built on Python and Rust. It provides a familiar dataframe interface for easy and performant processing of multimodal data at scale. Join us as we demonstrate how to build a multimodal data lakehouse using Daft on your existing infrastructure (S3, DeltaLake, Databricks and Spark).
SESSION SPEAKERS
Jay Chia
/Co-Founder
Eventual Computing