Building Real-Time BI Systems with Kafka, Spark, and Kudu - Databricks

Building Real-Time BI Systems with Kafka, Spark, and Kudu

Download Slides

One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.

Learn more:

  • Apache Kudu and Spark SQL for Fast Analytics on Fast Data
  • Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming
  • About Ruhollah Farchtchi

    Ruhollah Farchtchi is Chief Technologist and Vice President of Zoomdata Labs at Zoomdata. He has over 15 years experience in enterprise data management, architecture and systems integration. Prior to Zoomdata, Ruhollah held management positions at BearingPoint, Booz-Allen and Unisys. He holds an M.S. in Information Technology from George Mason University.