Session

Creating a Custom PySpark Stream Reader with PySpark 4.0

Overview

Experience	In Person
Type	Lightning Talk
Track	Data Engineering and Streaming
Industry	Enterprise Technology, Retail and CPG - Food, Travel and Hospitality
Technologies	Apache Spark, Delta Lake, Databricks SQL
Skill Level	Advanced
Duration	20 min

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

Session Speakers

IMAGE COMING SOON

Skyler Myers

/Head of Data Engineering
Entrada