Thamme Gowda is a graduate student (MS CS) at the University of Southern California, Los Angeles, and an intern at NASA Jet Propulsion Laboratory, Pasadena. His interests are in the areas of Machine Learning, Information Retrieval, and Distributed computing. As a member of the Apache Software Foundation, he supports and promotes open source technologies.
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. In this talk, Karanjeet Singh and Thamme Gowda will describe a new crawler called Sparkler (contraction of Spark-Crawler) that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.https://github.com/USCDataScience/sparklerLearn more: