Running Streaming Jobs Once a Day For 10x Cost Savings
This is the sixth post in a multi-part series about how you can perform complex streaming analytics using Apache Spark. Traditionally, when people think about streaming, terms such as “real-time,” “24/7,” or “always on” come to mind. You may have cases where data only arrives at fixed intervals. That is, data appears every hour or...
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2
This is the third post in a multi-part series about how you can perform complex streaming analytics using Apache Spark. In this blog, we will show how Spark SQL's APIs can be leveraged to consume and transform complex data streams from Apache Kafka. Using these simple APIs, you can express complex transformations like exactly-once event-time...
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1
In part 1 of this series on Structured Streaming blog posts, we demonstrated how easy it is to write an end-to-end streaming ETL pipeline using Structured Streaming that converts JSON CloudTrail logs into a Parquet table. The blog highlighted that one of the major challenges in building such pipelines is to read and transform data...
Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1
We are well into the Big Data era, with organizations collecting massive amounts of data on a continual basis. Yet, the value of this data deluge hinges on the ability to extract actionable insights in a timely fashion. Hence, there is an increasing need for continuous applications that can derive real-time actionable insights from massive...