Session

Creating a Custom PySpark Stream Reader with PySpark 4.0

Overview

ExperienceIn Person
TypeLightning Talk
TrackData Engineering and Streaming
IndustryEnterprise Technology, Retail and CPG - Food, Travel and Hospitality
TechnologiesApache Spark, Delta Lake, Databricks SQL
Skill LevelAdvanced
Duration20 min

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

Session Speakers

IMAGE COMING SOON

Skyler Myers

/Head of Data Engineering
Entrada