A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this talk, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster. https://github.com/USCDataScience/sparkler
Karanjeet Singh is a computer science graduate student at the University of Southern California (USC) who always finds himself drawn towards data challenges. He is also a research affiliate at NASA Jet Propulsion Laboratory working on Data Science projects funded by DARPA. In his free time, he loves contributing to the open source community. Prior to attending the graduate school, he was working at Computer Sciences Corporation (CSC) as a web developer for a U.S. based financial firm.
Thamme Gowda is a graduate student (MS CS) at the University of Southern California, Los Angeles, and an intern at NASA Jet Propulsion Laboratory, Pasadena. His interests are in the areas of Machine Learning, Information Retrieval, and Distributed computing. As a member of the Apache Software Foundation, he supports and promotes open source technologies.