Yahoo! Taiwan data team has two tracks of products, which includes both business intelligence (BI) reporting as well as machine learning. We also have two different tracks of data, which includes transaction data (which is stored in RDBMS dataware) as well as traffic data which is kept in the Hadoop farm. Because there is a platform gap between the two, combining these two separate data sources together is a vexing problem. For the BI team, providing a hybrid dimension report (including both traffic and transaction data) needs to leverage both Hive and RDBMS. And our machine learning team can only utilize the machine learning function based on traffic data.
Over the last year, we were introduced to Spark and Shark technology. Using Spark and Shark allows us to store data within a single source, which is the HDFS storage, and it provides an RDBMS interface via Shark. The BI team can build BI reports on top of the transaction and traffic data which both are stored in Shark. Also, the machine learning team can leverage transaction data and traffic data to build up more accurate machine learning models.
It is quite exciting that we can bridge these two different infrastructure together using Spark and Shark technology. Both the performance and results have far exceeded our expectations.
Wisely Chen is Sr. Engineer in Yahoo! Taiwan data team and he is also a lifetime learner in open source technology.