Grace Huang is currently an engineering manager in Intel SSG (Software and Services Group), responsible for advanced Big Data technology enhancement and optimization including Haodop, Spark and etc. Prior to that, she had been working in the big data area in Intel for over 5 years, with intensive experience on Hadoop, HBase performance tuning and optimization.
As “the most active cluster data processing engine after Hadoop MapReduce”, Spark has already gathered a large community of users and gradually entered the datacenter for next-gen big data applications. During the past year, we spent a lot of efforts on building real-world applications by using Spark for several big web sites(e.g., Alibaba, iQiyi, Youku and etc.). Those experiences demonstrated real needs and concrete usage of Spark in graph analysis, interactive, batch OLAP/BI and real-time analytics. And also some learning of using Spark is obtained, for example memory management, analytic query execution and so on. In this talk, we will present our experience and also several lessons learned while building real-world Spark application in production environment.
New big data analysis paradigms beyond MapReduce have inevitably emerged. Particularly, there is increasing demand to mine and explore data in a real-time, streaming manner. As the next generation big data analytics stack, Spark already well served such kind of new Real-Time Analytical Processing paradigm with further development efforts. To be a complete data stream management system (DSMS), some SQL-liked streams manipulation is quite essential for better user experience in RTAP paradigm. In this talk, we will present our POC implementation of StreamSQL by using Spark-streaming and Catalyst modules, which makes SQL-user quickly grasp stream processing with ease. Currently, it supports simple stream queries and mutual operations between streams and structured data, and also typical usages in Catalyst (e.g., LINQ expressions, mixture of SQL and DStream operators). Furthermore, our on-going or future work is also mentioned, like window sliding support, DDL by using Hive, some uniform streams in/out format and so on.