Sparkler—Crawler on Apache Spark - Databricks

Sparkler—Crawler on Apache Spark

Download Slides

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this talk, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. https://github.com/USCDataScience/sparkler

Learn more:

  • Streaming – Getting Started with Apache Spark on Databricks
  • Diving into Apache Spark Streaming’s Execution Model
  • About Karanjeet Singh

    Karanjeet Singh is a computer science graduate student at the University of Southern California (USC) who always finds himself drawn towards data challenges. He is also a research affiliate at NASA Jet Propulsion Laboratory working on Data Science projects funded by DARPA. In his free time, he loves contributing to the open source community. Prior to attending the graduate school, he was working at Computer Sciences Corporation (CSC) as a web developer for a U.S. based financial firm.

    About Thamme Gowda Narayanaswamy

    Thamme Gowda is a graduate student (MS CS) at the University of Southern California, Los Angeles, and an intern at NASA Jet Propulsion Laboratory, Pasadena. His interests are in the areas of Machine Learning, Information Retrieval, and Distributed computing. As a member of the Apache Software Foundation, he supports and promotes open source technologies.