Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on large columnar datasets through full scan. However when dealing with document datasets where the features are represented as variable number of columns in each document and use-cases demand searching over columns to retrieve documents and generate machine learning models on the documents retrieved in near realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO (data access object) that supports building distributed lucene shards from dataframe and save the shards to HDFS for query processors like SolrCloud. Lucene shards maintain the document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries. We developed Spark Mllib based estimators for classification and recommendation pipelines and use the vector space representation to train, cross-validate and score models to power synchronous and asynchronous APIs. Our synchronous API uses Spark-as-a-Service while our asynchronous API uses kafka, spark streaming and HBase for maintaining job status and results. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms. We will show algorithmic details of our model building pipelines that are employed for performance optimization and demonstrate latency of the APIs on a suite of queries generated over 1M+ terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered search a first class citizen to build interactive machine learning pipelines using Spark Mllib.
Debasish Das joined Verizon's Big Data Analytics Group in 2013. Prior to joining Verizon Debasish worked at Intel, Synopsys, Magma and Mentor Graphics. He did his PhD in EECS from Northwestern and BTech in CS from IIT Kharagpur. His current interests include scaling distributed convex optimization, developing machine learning algorithms for batch/streaming workflows and serving the prediction output at scale. His current focus is on developing recommendation engines for mobile advertising and anomaly detection for network security. He has contributed to open source projects like Apache Spark, ScalaNLP Breeze and Embotech ECOS.
Pramod Lakshmi Narasimha is currently Principal Engineer at Verizon's Big Data Analytics Group. He has developed scalable machine learning solutions for Verizon's marketing analytics and ad-targeting products. His interests include data analysis and feature engineering on large datasets.