Weiqing has been working in the Apache Hadoop ecosystem since 2015 and is a Spark/HBase/Ambari/Hadoop contributor. She is currently a software engineer in Spark team at Hortonworks. Before that, she obtained a Master Degree in Computational Data Science from Carnegie Mellon University. In 2011-2013, she was a software engineer at Schlumberger. At that time, she was working on a real-time acquisition system designed for field engineers to acquire and process various types of underground data.
Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very challenging topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance easy, while achieving a good tradeoff between performance and simplicity. In addition to fully supporting all the Avro schemas natively, SHC has also integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against an Dataframe which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system. In this talk, apart from explaining why SHC is of great use, we will also demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multiple secure HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level. Session hashtag: #EUeco7Learn more: