New big data analysis paradigms beyond MapReduce have inevitably emerged. Particularly, there is increasing demand to mine and explore data in a real-time, streaming manner. As the next generation big data analytics stack, Spark already well served such kind of new Real-Time Analytical Processing paradigm with further development efforts. To be a complete data stream management system (DSMS), some SQL-liked streams manipulation is quite essential for better user experience in RTAP paradigm.
In this talk, we will present our POC implementation of StreamSQL by using Spark-streaming and Catalyst modules, which makes SQL-user quickly grasp stream processing with ease. Currently, it supports simple stream queries and mutual operations between streams and structured data, and also typical usages in Catalyst (e.g., LINQ expressions, mixture of SQL and DStream operators). Furthermore, our on-going or future work is also mentioned, like window sliding support, DDL by using Hive, some uniform streams in/out format and so on.
Grace Huang is currently an engineering manager in Intel SSG (Software and Services Group), responsible for advanced Big Data technology enhancement and optimization including Haodop, Spark and etc. Prior to that, she had been working in the big data area in Intel for over 5 years, with intensive experience on Hadoop, HBase performance tuning and optimization.
Jerry Shao works as Software Engineer at Intel Big data team, focused on Spark ecosystem, active Spark contributor. His interests including distributed storage and computing system.