Kiran Chitturi is a software developer at Lucidworks. He works on Lucidworks enterprise product Fusion and currently leads the spark-solr code (https://github.com/LucidWorks/spark-solr).
Solr has been adopted by all major Hadoop platform vendors as the de facto standard for big data search because of its ability to scale to meet even the most demanding workflows. As more organizations seek to leverage Spark for big data analytics and machine learning, the need for seamless integration between Spark and Solr emerges. In this presentation, Kiran Chitturi introduces an open source project that exposes Solr as a SparkSQL DataSource. Attendees will come away with a solid understanding of common use cases, access to open source code, and performance metrics to help them develop their own large-scale search and discovery solution with Spark and Solr. Specifically, Kiran covers the following topics: + Using deep-paging cursors, streaming result sets, and intra-shard splitting to maximize read performance when constructing RDDs from Solr queries + High-volume reads into Spark using DocValues and Solr's streaming API + Data-locality optimizations when Solr and Spark executors are co-located on the same host + Writing DataFrames to Solr + Writing to Solr from Spark streaming jobs + Using Solr/Lucene Analyzers to perform text analysis in Spark ML pipelines When discussing big data, especially search on big data, it's important to establish performance metrics. For instance, how many docs per second can be indexed from Spark to Solr using this framework? Or, how many rows can be processed per second when reading data from Solr into Spark? Kiran concludes his presentation by showing read/write performance metrics achieved using a 10-node Spark / SolrCloud cluster running on YARN in EC2.