At Databricks, we are often asked how to go beyond the basic Apache Spark tutorials and start building real applications with Spark. As a result, we are developing reference applications on github to demonstrate that. We believe this is a great way to learn Spark, and we plan on incorporating more features of Spark into the applications over time. We also hope to highlight any technologies that are compatible with Spark and include best practices.
Our first reference application is log analysis with Spark. Logs are a large and common data set that contain a rich set of information. Log data can be used for monitoring web servers, improving business and customer intelligence, building recommendation systems, preventing fraud, and much more. Spark is a wonderful tool to use on logs - Spark can process logs faster than Hadoop MapReduce, it is easy to code so we can compute many statistics with ease, and many of Spark’s libraries can be used on log data. Also, Spark SQL can be used for querying logs using familiar SQL syntax and Spark Streaming can be used to provide real-time logs analysis.
The log analysis reference application is broken down into sections with detailed explanations and small code examples to demonstrate various features of Spark. We start off with a gentle example of processing historical log data using standalone Spark, then cover how to do the same analysis with Spark SQL, before covering Spark Streaming. Along the way, we highlight any caveats or recommended best practices for using Spark - such as how to refactor your code for reuse between the batch and streaming libraries. The examples are self-contained and emphasize one aspect of Spark at a time, so you can experiment and modify the examples to deepen your understanding of that topic.
Code from the examples is combined to form a sample end-to-end application as shown in the diagram above. For now, the final application is an MVP log analysis streaming application. The application monitors a directory for new log input files, and when it receives one - the file is processed. Spark Streaming is used to compute statistics for the last N time as well as all of time, and refreshed. Each time interval, an html file is written which reflects the latest log data in the folder.
The second reference application is a popular demo that the Databricks team has given, and was video taped here:
This application demonstrates the following:
Now, we are releasing the code so you can run the demo yourself and walk through how it works.
These applications are just a start though - we will add more and improve our reference applications over time. Please follow these steps to get started: