Kapil is a data architect with experience in building scalable, reliable, distributed compute and data platforms. He has worked at Adobe, Microsoft, Snapdeal and is currently working at Amazon.
Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work. Session hashtag: #EUeco9