Xiao Li is an engineering manager, Apache Spark Committer and PMC member at Databricks. His main interests are on Spark SQL, data replication and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication and consistency verification. He received his Ph.D. from University of Florida in 2011.
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: accelerator-aware scheduling, adaptive query execution, dynamic partition pruning, join hints, new query explain, better ANSI compliance, observable metrics, new UI for structured streaming, new UDAF and built-in functions, new unified interface for Pandas UDF, and various enhancements in the built-in data sources [e.g., parquet, ORC and JDBC].
The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance. Session hashtag: #Exp3SAIS