Building a Dataset Search Engine with Spark and Elasticsearch - Databricks

Building a Dataset Search Engine with Spark and Elasticsearch

Download Slides

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.

Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

Learn more:

  • ElasticSearch
  • Application Spotlight: Elasticsearch
  • Using Spark and Elasticsearch for real-time data analysis
  • About Oscar Casta├▒eda-Villagr├ín

    Oscar studied Computer Science at Delft University of Technology. He's now a Data Scientist at Xoom a PayPal service and a researcher for Universidad del Valle de Guatemala. Oscar is interested in Dataset Search, Learning to Rank, and Apache Spark and is a proponent of Model-Driven Data Product Design & Development.