Creating a Custom PySpark Stream Reader with PySpark 4.0
Overview
Experience | In Person |
---|---|
Type | Lightning Talk |
Track | Data Engineering and Streaming |
Industry | Enterprise Technology, Retail and CPG - Food, Travel and Hospitality |
Technologies | Apache Spark, Delta Lake, Databricks SQL |
Skill Level | Advanced |
Duration | 20 min |
PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.
Session Speakers
IMAGE COMING SOON
Skyler Myers
/Head of Data Engineering
Entrada