Wenchen Fan

Software Engineer, Databricks

Wenchen Fan is a software engineer at Databricks, working on Spark Core and Spark SQL. He mainly focuses on the Apache Spark open source community, leading the discussion and reviews of many features/fixes in Spark. He is a Spark committer and a Spark PMC member.

UPCOMING SESSIONS

PAST SESSIONS

Deep Dive into the New Features of Apache Spark 3.0Summit 2020

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.

The following features are covered: accelerator-aware scheduling, adaptive query execution, dynamic partition pruning, join hints, new query explain, better ANSI compliance, observable metrics, new UI for structured streaming, new UDAF and built-in functions, new unified interface for Pandas UDF, and various enhancements in the built-in data sources [e.g., parquet, ORC and JDBC].

Talk Transcript

About us and our Open Source contributions

Hello, everyone. Today, Wenchen and I are glad to share with you the latest updates about the upcoming release, Spark 3.0.

So I'm Xiao Li. Both Wenchen and I are working for Databricks. We focus on the open-source developments. Both of us are Spark committers and PMC members.

About Databricks

Databricks provides a unified data analytics platform to accelerate your data-driven animation. We are a global company with more than 5,000 customers across various industries, and we have more than 450 partners worldwide. And most of you might have heard of Databricks as original creator of Spark, Delta Lake, MLflow, and Koalas. These are open-source projects that are leading innovation in the fields of data and machine learnings. We continue to contribute and nurture this open-source community.

Spark 3.0 Highlights

In Spark 3.0, the whole community resolved more than 3,400 JIRAs. Spark SQL and the Core are the new core module, and all the other components are built on Spark SQL and the Core. Today, the pull requests for Spark SQL and the core constitute more than 60% of Spark 3.0. In the last few releases, the percentage keeps going up. Today, we will focus on the key features in both Spark SQL and the Core.

This release delivered many new capabilities, performance gains, and extended compatibility for the Spark ecosystem. This is a combination of the tremendous contributions from the open-source community. It is impossible to discuss the new features within 16 minutes. We resolved more than 3,400 JIRAs. Even in this light, I did my best, but I only can put 24 new Spark 3.0 features. Today, we would like to present some of them. First, let us talk about the performance-related features.

Spark 3.0 Enhanced Performance

High performance is one of the major advantages when people select Spark as their computation engine. This release keeps enhancing the performance for interactive, batch, streaming, and [inaudible] workloads. Here, I will first cover four of the performance features in SQL query compilers. Later, Wenchen will talk about the performance enhancement for building data sources.

The four major features in query compilers include a new framework for adaptive query execution and a new runtime filtering for dynamic partition pruning. And also, we greatly reduce the overhead of our query compiler by more than a half, especially on the optimizer overhead and the SQL cache synchronization. Supporting a complete set of join hints is another useful features many people are waiting for.

Adaptive query execution was available at the previous releases. However, the previous framework has a few major drawbacks. Very few companies are using it in the production systems. In this release, Databricks and the [inaudible] work together and redesigned the new framework and resolved all the known issues. Let us talk about what we did in this release.

Spark Catalyst Optimizer

Michael Armbrust. He is the creator of Spark SQL and also, Catalyst Optimizer. In the initial release of Spark SQL, all the optimizer rules are heuristic-based. To generate good query plans, the query optimizer needs to understand the data characteristics. Then in Spark 2.x, we introduced a cost-based optimizer. However, in most cases, data statistics are commonly absent, especially when statistics collection is even more expensive than the data processing in the [search?]. Even if the statistics are available, the statistics are likely out of date. Based on the storage and the compute separation in Spark, the characteristics of data [rival?] is unpredictable. The costs are often misestimated due to the different deployment environment and the black box user-defined functions. We are unable to estimate the cost for the UDF. Basically, in many cases, Spark optimizer is enabled to generate the best plan due to this limitation.

Adaptive Query Execution

For all these reasons, runtime adaptivity becomes more critical for Spark than the traditional systems. So this release introduced a new adaptive query execution framework called AQE. The basic idea of adaptive planning is simple. We optimize the execution plan using the existing rules of the optimizer and the planner after we collect more accurate statistics from the finished plans.

The red line shows the new logics we added in this release. Instead of directly optimizing the execution plans, we send back the unfinished plan segments and then use an existing optimizer and planner to optimize them at the end and build a new execution plan. This release includes three adaptive features. We can convert the soft merge join to broadcast hash join, based on the runtime statistics. We can shrink the number of reducers after over-partitioning. We can also handle the skew join at runtime. If you want to know more details, please read the blog post I posted here. Today, I will briefly explain them one by one.

Maybe most of you already learn many performance tuning tips. For example, to make your join faster, you might guide your optimizer to choose a broadcast hash join instead of the sort merge join. You can increase the spark.sql.autobroadcastjointhreshold or use a broadcast join hint. However, it is hard to tune it. You might hit out of memory exceptions and even get worse performance. Even if it works now, it is hard to maintain over time because it is sensitive to your data workloads.

You might be wondering why Spark is unable to make the wise choice by itself. I can easily list multiple reasons.
  • the statistics might be missing or out of date.
  • the file is compressed.
  • the file format is column-based, so the file size does not represent the actual data volume.
  • the filters could be compressed
  • (the filters) might also contain the black box UDFs.
  • The whole query fragments might be large, complex, and it is hard to estimate the actual data volume for Spark to make the best choice.

Convert Sort Merge Join to Broadcast Hash Join

So this is an example to show how AQE converts a sort merge join to a broadcast hash join at runtime. First, execute the leave stages. Query the statistics from the shuffle operators which materialize the query fragments. You can see the actual size of stage two is much smaller than the estimated size reduced from 30 megabytes to 8 megabytes so we can optimize the remaining plan and change the join algorithm from sort merge join to broadcast hash join.

Another popular performance tuning tip is to tune the configuration spark.sql.shuffle.partitions. The default value is a magic number, 200. Previously, the original default is 8. Later, it was increased to 200. I believe no one knows the reason why it become 200 instead of 50, 400, or 2,000. It is very hard to tune it, to be honest. Because it is a global configuration, it is almost impossible to decide the best value for every query's fragment using a single configuration, especially when your query plan is huge and complex.

If you set it to very small values, the partition will be huge, and the aggregation and the sort might need to spew the data to the disk. If the configuration values are too big, the partition will be small. But the number of partitions is big. It will cause inefficient IO and the performance bottleneck could be the task scheduler. Then it will slow down everybody. Also, it is very hard to maintain over time.

Dynamically Coalesce Shuffle Partitions

Until you can solve it in a smart way, we can first increase our initial partition number to a big one. After we execute the leave query stage, we can know the actual size of each partition. Then we can automatically correlate the nearby partitions and automatically reduce the number of partitions to a smaller number. This example shows how we reduce the number of partitions from 50 to 5 at runtime. And we added the actual coalesce at runtime.

Data Skew

One more popular performance tuning tip is about data skew. Data skew is very annoying. You could see some long-running or frozen task and a lot of disks spinning and a very low resource authorization rate in most nodes and even out of memory. Our Spark community might tell you many different ways to solve such a typical performance problem. You can find the skew value and the right queries to handle the skew value separately. And also, you can add the actual skew keys that can remove the data skew, either new columns or some existing columns. Anyway, you have to manually rewrite your queries, and this is annoying and sensitive to your workloads, too, which could be changed over time.

This is an example without the skew optimization. Because of data skew, after the shuffle, the shuffle partition, A0, will be very large. If we do a join on these two tables, the whole performance bottleneck is to join the values for this specific partition, A0. For this partition, A0, the cost of shuffle, sort, and merge are much bigger than the other partitions. Everyone is waiting for the partition 0 to complete and slow down the execution of the whole query.

Our adaptive query execution can handle it very well in the data skew case. After executing the leaf stages (stages one and stage two), we can optimize our queries with a skew shuffle reader. Basically, it will split the skew partitions into smaller subpartitions after we realize some shuffle partitions are too big.

Let us use same example to show how to resolve it using adaptive query execution. After realizing partitions are too large, AQE will add a skew reader to automatically split table A's partition part 0 to three segments: split 0, split 1, and split 2. Then it will also duplicate another side for table B. Then we will have three copies for table B's part 0.

After this step, we can parallelize the shuffle reading, sorting, merging for this split partition A0. We can avoid generating very big partition for the sort merge join. Overall, it will be much faster.

Based on a terabyte of TPC-DS benchmark, without statistics, Spark 3.0 can make Q7 eight times faster and also achieve two times fast and speed up for Q5 and more than 1.1 speed up for another 26 queries. So this is just the beginning. In the future releases, we will continue to improve the compiler and introduce more new adaptive rules.

Dynamic Partition Pruning

The second performance features I want to highlight is dynamic partition pruning. So this is another runtime optimization rule. Basically, dynamic partition pruning is to avoid partition scanning based on the queried results of the other query fragments. It is important for star schema queries. We can achieve a significant speed up in TPC-DS queries.

So this is a number, in a TPC-DS benchmark, 60 out of 102 queries show a significant speed up between 2 times and 18 times. It is to prune the partitions that joins read from the fact table T1 by identifying those partitions that result from filtering the dimension table, T2.

Let us explain it step by step. First, we will do the filter push down in the left side. And on the right side, we can generate a new filter for the partition column PP because join P is a partition column. Then we get the query results of the left side. We can reuse our query results and generate the lists of constant values, EPP, and filter result. Now, we can push down the in filter in the right side. This will avoid scanning all the partitions of the huge fact table, T1. For this example, we can avoid scanning 90% of partitioning. With this dynamic partition pruning, we can achieve 33 times speed up.

JOIN Optimizer Hints

So the last performance feature is join hints. Join hints are very common optimizer hints. It can influence the optimizer to choose an expected join strategies. Previously, we already have a broadcast hash join. In this release, we also add the hints for the other three join strategies: sort merge join, shuffle hash join, and the shuffle nested loop join.

Please remember, this should be used very carefully. It is difficult to manage over time because it is sensitive to your workloads. If your workloads' patterns are not stable, the hint could even make your query much slower.

Here are examples how to use these hints in the SQL queries. You also can do the same thing in the DataFrame API. When we decide the join strategies, [our leads are different here?].

  • So a broadcast hash join requires one side to be small, no shuffle, no sort, so it performs very fast.
  • For the shuffle hash join, it needs to shuffle the data but no sort is needed. So it can handle the large tables but will still hit out of memory if the data is skewed.
  • Sort merge join is much more robust. It can handle any data size. It needs to shuffle and salt data slower in most cases when the table size is small compared with a broadcast hash join.
  • And also, shuffle nested loop join, it doesn't require the join keys, unlike the other three join strategies.

Richer APIs: new features and simplify development

To enable new use cases and simplify the Spark application development, this release delivers a new capability and enhanced interesting features.

Pandas UDF

Let's, first, talk about Pandas UDF. This is a pretty popular performance features for the PySpark users.

So let us talk about the history of UDF support in PySpark. In the first release of Python support, 2013, we already support Python lambda functions for RDD API. Then in 2014, users can register Python UDF for Spark SQL. Starting from Spark 2.0, Python UDF registration is session-based. And then next year, users can register the use of Java UDF in Python API. In 2018, we introduced Pandas UDF. In this release, we redesigned the interface for Pandas UDF by using the Python tab hints and added more tabs for the Pandas UDFs.

To adjust our compatibility with the old Pandas UDFs from Apache Spark 2.0 with the Python 2.6 and above, Python [inaudible] such as pandas.Series, Pandas DataFrame, cube hole, and the iterator can be used to impress new Pandas UDF types. For example, in Spark 2.3, we have a Scala UDF. The input is a pandas.Series and its output is also pandas.Series. In Spark 2.0, we do not require users to remember any UDF types. You just need to specify the input and the output types. In Spark 2.3, we also have a Grouped Map Pandas UDF, so input is a Pandas DataFrame, and the output is also Pandas DataFrames.

Old vs New Pandas UDF interface

This slide shows the difference between the old and the new interface. The same here. The new interface can also be used for the existing Grouped Aggregate Pandas UDFs. In addition, the old Pandas UDF was split into two API categories: Pandas UDFs and Pandas function APIs. You can treat Pandas UDFs in the same way that you use the other PySpark column instance.

For example, here, calculate the values. You are calling the Pandas UDF calculate. We do support the new Pandas UDF types from iterators of series to iterator other series and from iterators of multiple series to iterator of series. So this is useful for [inaudible] state initialization of your Pandas UDFs and also useful for Pandas UDF parquet.

However, you can now use Pandas function APIs with this column instance. Here are these two examples: map Pandas function API and the core group, the map Pandas UDF, the APIs. These APIs are newly added in these units.

Back to Wenchen

So next, Wenchen will go over the remaining features and provide a deep dive into accumulator with Scalar. Please welcome Wenchen.

Thanks, Xiao, for the first half of the talk. Now, let me take over from here and introduce the remaining Spark 3.0 features.

Accelerator-aware Scheduling

I will start with, straight away, our scheduler. In 2018 Spark Summit, we already announced the new project [inaudible]. As you're now aware, our scheduler is part of this project. It can be widely used for executing special workloads. In this release, we support standalone, YARN, and Kubernetes scheduler scheduler backend. So far, users need to specify the require resources using a [inaudible] configs.

In the future, we will support the job, stage, and task levels. To further understand this feature, let's look at the workflow. Ideally, the cost manager should be able to automatically discover resources, like GPUs. When the user submits an application with resource request, Spark should pass the resources request to a cluster manager and then the cluster manager cooperates to allocate and launch executors with the required resources. After Spark job is submitted, Spark should schedule tasks on available executors, and the cluster manager should track the results usage and perform dynamic resource allocation.

For example, when there are too many pending tasks, the cluster manager should allocate more executors to run more tasks at the same time. When a task is running, the user shall be able to retrieve the assigned resources and use them in their code. In the meanwhile, cluster manager shall monitor and recover failed executions. Now, let's look at how can a cluster manager discover resources and how can users request resources.

As an admin of the cluster, I can specify a script to auto discover executors. The discovery script can be specified separately on Java as executors. We also provided an example to auto discover Nvidia GPU resources. You can adjust this example script for other kinds of resources. Then as a user of Spark, I can request resources at the application level. I can use the config spark.executor.resource.{resourceName}.amount and the corresponding config for Java to specify the executors amount on the Java and executors. Also, I can use the config spark.task.resource.{resourceName}.amount to specify the executors required by each task. As I mentioned earlier, we will support more time-proven labor later, like the job or stage labor. Please stay tuned.

Retrieve Assigned Accelerators

Next, we'll see how you can leverage the assigned executors to actually execute your workloads, which is probably the most important part to the users. So as a user of Spark, I can retrieve the assigned executors from the task content. Here is an example in PySpark. The contents of resources returns a map from the resource name to resource info. In the example, we request for GPUs, and we can take the GPU address from the resource map. Then we launch the TensorFlow to train my model within GPUs. Spark will take care of the resource allocation and acceleration and also monitor the executors here for failure recovery, which makes my life much easier.

Cluster Manager Support

As I mentioned earlier, the executor aware of scheduling support has been added to standalone, YARN, and Kubernetes cost manager. You can check the Spark JIRA tickets to see more details. Unfortunately, the Mesos support is still not available. We'd really appreciate it if any Mesos expert has interest and is willing to help the Spark community to add the Mesos support. Please leave a comment in the [inaudible] if you want to work on it. Thanks in advance.

Improved Spark Web UI for Accelerators

Last but not the least, we also improved the Spark Web UI to show all the discovery resources on the executor page. In this page, we can see that there are GPUs available on the executor one. You can check the Web UI to see how many executors are available in the cluster, so you can better schedule your jobs.

32 New Built-in Functions

In this release, we also introduced 32 new built-in functions and add high auto functions in the Scalar API. The Spark community pays a lot of attention to compatibility. We have investigated many other ecosystems, like the PostgreSQL, and implemented many commonly used functions in Spark.

Hopefully, these new built-in functions can make it faster to build your queries as you don't need to waste time to learn a lot of UDFs.

Due to the time limitations, I can't go over all the functions here, so let me just introduce some map type functions as an example.

When you deal with map type values, it's common to get the keys and values for the map as an array. There are two functions, map keys and map values can do this for you. The example is from the Databricks runtime notebook. Or you may want to do something more complicated, like creating a new map by transforming the original map where it's a keys and a map values functions. So if there are two functions, transform keys and transform values can do this for you, and you just need to write a handler function to specify the transformation logic.

As I mentioned earlier, the functions also have Scalar APIs rather than the SQL API. Here is an example about how to do the same thing, but it's a Scalar API. You can just write a normal Scala function, which takes the [kernel?] objects as the input to have the same effect as the SQL API.

Monitoring and Debuggability

This release also includes many enhancements and makes the monitoring more comprehensive and stable. We can make it easier to close out and get back to your Spark applications.

Structured Streaming UI

The first feature I will talk to you about is the new UI for the Spark streaming. Here, the drive to show it-- Spark streaming was initially introduced in Spark 2.0. This release has the dedicated-- it was Spark web UI for inspection of these streaming jobs. This UI offers two sets of statistics: one, abbreviate information of [completed?] streaming queries and two, detailed statistics information about the streaming query including the input rate, processor rate, input loads, [inaudible], operation duration and others. More specifically, the input rate and processor rate means how many records per second the streaming software produces and the Spark streaming engine processes. It can give you a sense about if the streaming engine is fast enough to process the continuous input data. Similarly, you can tell it from the past duration as well. If many batch takes more time than the micro-batch [inaudible], it means the engine is not fast enough to process your data, and you may need to enable the [inaudible] feature to make the source produce the data slower.

And so operating time is also a very useful matrix. It tells you the time spent on each operator so that you can know where is the bottleneck in your query.

DDL and DML enhancements

We also have many different enhancements in DDL and DML commands. Let me talk about the improvements in the EXPLAIN command as an example. This is a typical output of the EXPLAIN command. You have many operators in the query plan tree and some operators have other additional information. Reading plans is critical for understanding and attuning queries. The existing solution looks [inaudible], and, as a stream of each operator, can be very wide or even truncated. And it becomes wider and wider each release as we add more and more information in the operator to help debugging.

This release, we enhance the EXPLAIN command with a new formatted mode and also provided a capability to dump the plans to the files. You can see it becomes much easier to read and understand. So here is a very simple plan tree at the beginning. Then follows a detailed section for each operator. This makes it very easy to get an overview of the query by looking at the plan tree. It also makes it very easy to see the details of each operator as the information is now stacked vertically. And in the end, there is a section to show all the subqueries. In the future releases, we will add more and more useful information for each operator.

This release, we also introduced a new API to define your own metrics to observe data quality. Data quality is very important to many applications. It's usually easy to define metrics for data quality by some [other?] function, for example, but it's also hard to calculate the metrics, especially for streaming queries.

For example, you want to keep monitoring the data quality of your streaming source. You can simply define the metrics as the percentage of the error records. Then you can do two things. Make it a habit. One, code observe method of the streaming error rate to define your metrics with a name and the start of stream. So this example, the name is data quality and the matrix, it just will count the error record and see how many percent of it in the total lookups.

Two, you add a listener to watch the streaming process events, and in the case of your matrix, the name, do whatever you want to do, such as sending an email if there are more than 5% error data.

SQL Compatibility

Now, let's move to the next topic. SQL compatibility is also super critical for workloads mapped from the other database systems through Spark SQL. In this release, we introduced the ANSI store assignment policy for table insertion. We added runtime overall checking with respect to ANSI results keywords into the parser. We also switched the calendar to the widely-used calendar which is the ISO and SQL standard.

Let's look at how the first two features can help you to enforce data quality. I say more about the assignment. It's something like assigning a value to a variable in programming language. In the SQL world, it is table insertion or upsert, which is kind of assigning values to a table column.

Now, let's see an example. Assume there is a table with two columns, I and J, which are type int and type string. If we write a int value to the string column, it's totally okay. It's totally safe. However, if we write a string value to the int column, it's risky. The string value is very likely to not be in integer form, and Spark will fail and worry about it.

If you do believe your string values are safe to be inserted into an int column, you can add a cast manually to bypass the type check in Spark.

We can also write long type of values to the int column, and Spark will do the overflow check at runtime. If your input data is invalid, Spark will be show exception at runtime to tell you about it. In this example, the integer one is okay, but the larger value below can't fit the integer type, and you'll receive this error if you run this table insertion command which tells you about the overflow problem.

Built-in Data Source Enhancements

Also, this release enhances built-in data sources. For example, to populate data source, we can't do nested column and filter pushdown. Also, we support [inaudible] for CSV files. This release also introduced a new [inaudible] resource and also a new [inaudible] resource for testing and benchmarking. Let me further introduce the origin of nested columns in Parquet and ORC resource.

The first one is a kind of [baloney?]. [inaudible] like [inaudible] and [ORC?], we can skip reading some [inaudible] in the blocks if they don't contain the columns we need. This [technique?] can be applied to nested columns as well in Spark 2.0. To check if your query-- in Spark 3.0. To check if your query can benefit from this [inaudible] or not, you can run the EXPLAIN command and see if the read schema over the file scan note strips the [inaudible] nested columns. In this example, only the nested column is inserted, so the read schema only contains A.

[inaudible] is also a very popular technical [inaudible]. Similarly, you can also check the expand result and see if the [pushed?] filters over the file scan note contains the name of column filters. In this example, we do have a filter where it's nested column A, and it does appear in the [pushed?] filter, which makes this version happen in this query.

Catalog plugin API

This release also expands other efforts on the extensibility and ecosystem like the v2 API enhancements, Java 11, Hadoop, and [inaudible] support. [inaudible] API. This release extends the [inaudible] to API by adding the Catalog plugin. The Catalog plug-in API allows users to reject their own [inaudible] and take over the [inaudible] data operations from Spark. This can give end users a more seamless experience to assist external tables. Now, end users [inaudible] reject [inaudible] and manipulate the tables [inaudible], where before, end users have to reject each table. For example, let's say you have rejected a MySQL connector [inaudible] named MySQL. You can use SELECT to get data from existing MySQL table. We can also INSERT into a MySQL table with Spark's [inaudible]. You can also create an outer tables in MySQL with Spark, which was just not possible before, because before, we don't have the Catalog plug-in. Now this example will be available in Spark 3.1 when we finish [inaudible].

When to use Data Source V2 API?

Some people may have a question. Now Spark has both - it has a V1 and a V2 APIs - which one should I use? In general, we want everyone to move to V2 [inaudible]. But the V2 API is not ready yet as we need more feedback to polish the API. Here are some tips about when to pick the V2 API. So if you want [inaudible], the catalogue function it is, so it has to be the V2 because V1 API doesn't have this ability. If you want to support both versions streaming in your data source, then you should use V2 because in V1, the streaming and the [inaudible] are different APIs which makes it harder to reuse the code. And if you are sensitive to the scan performance, then you can try the V2 API because it allows you to report the data provisioning to [inaudible] in Spark, and also it allows you to implement [inaudible] reader for better performance.

If you don't care about this stuff and just want to [inaudible] source once and you change it, please use the V1 as V2 is not very stable.

Extensibility and Ecosystem

The ecosystem also evolves very fast. In this release, Spark can be better integrated into the ecosystem by supporting the newer version of these common components like Java 11, Hadoop 3, Hadoop 3 [inaudible], and Hadoop 2.3 [inaudible]. I want to mention some breaking changes here.

Starting from this release, we're only building Spark with Scala 2.12, so Scala 2.11 is no longer [inaudible]. And we deprecated Python 2 too because it is end of life. In the download image, we put a build of Spark with different [inaudible] and Hadoop combinations. By default, it will be Hadoop 2.7 and it would have 2.3 [exclusion?]. There are another two [companies?] of previews available. One is Hadoop 2.7 and Hadoop 1.2 execution, which is for people who can't upgrade their end forms. The other is Hadoop 3.2 and Hadoop 2.3 execution, which is for people who want to try Hadoop 3. We also extend the support for different Hadoop and Hive versions from 0.12 to 3.1.

Documentation Improvements

Documentation improvements is the last existing news I want to share with everyone. How to read on a standard the web UI is a common question to many new Spark users. This is especially true for Spark SQL users and Spark streaming users. They are using the [inaudible]. They usually don't know what it is, and what our jobs [inaudible]. Also, the [inaudible] are using many queries and matrix names, which are not very clear to many users. Starting from this release, we add a new section for [inaudible] reading the web UI. It includes the [inaudible] job page and [inaudible] and also SQL streaming [inaudible]. This is just a start. We will continue to enhance it, then SQL reference.

Finally, this release already has a SQL reference for Spark SQL. Spark SQL is the most popular and important component in Spark. However, we did not have our own SQL reference to define the SQL [semantic?] and detailed behaviors. Let me quickly go over the major chapters in SQL reference. So we have a page to explain the ANSI components of Spark. So as I mentioned before, we have SQL compatibility, but to avoid [correcting?] the [effecting?] queries, we make it optional. So you can only enable the ANSI compatibility by enabling this flag.

You also have a page to explain the detailed semantic of each [inaudible], so you can know what it means and what's the behavior of them. You also have a page to explain the data and partner strings used for formatting and parsing functions [inaudible]. There's also a page to give the document for each function in Spark. We also have a page to explain the syntax, how to define the table or function [inaudible]. Also, there's a page to explain the syntax and the semantics of each [inaudible] in Spark SQL.

Also, there's a page to explain the null semantic. The null is a very special value in Spark SQL and other ecosystems. So there must be a page to either explain what's the meaning of null in the null queries. Also, we have a page to explain the syntax for all the commands, like DDL and DML commands, and also insert is also included in the document. In fact, SELECT has so many features, so we want to have a page to explain all of them. Yeah, there are also a page for other special commands like SHOW TABLES.

Finally, [inaudible]. This is another critical enhancements in Spark 3.0 document. In this release, all the components have [inaudible] guides. When you upgrade your Spark version, you can read them carefully, and, in fact, [inaudible]. You might be wondering why it is much longer than the previous version. It's because we try to document all the important looking changes you want to hear. If you upgrade into some errors, that's another wordy document or slightly confusing error message, please open a ticket, and we will try and fix it in subsequent releases.

Now, that Spark is almost 10 years old now. The Spark community is very serious about making change. And we try our best to avoid [inaudible] changing. If you upgrade to Spark 3.0 at this time, you may see explicit error messages about changing. So the error message also provides config names for you to either go back to existing behavior or go with the new behavior.

this talk, we talked about many exciting features and improvements in Spark 3.0. Due to the lack of time, there are still many other nice features not being covered by this talk. Please download Spark 3.0 and try yourself.

You can also try the [inaudible] Databricks [inaudible] 10.0 beta. All the new features are already available. The Community Edition is for free. Without the contributions by the whole community, it is impossible to deliver such a successful release. It's thanks to all the Spark committers all over the world. Thank you. Thank you, everyone.

Apache Spark Data Source V2—continuesSummit 2018

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility. Session hashtag: #DDSAIS12

Apache Spark Data Source V2Summit 2018

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility. Session hashtag: #DDSAIS12

Deep Dive into Spark SQL with Advanced Performance TuningSummit 2018

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance. Session hashtag: #Exp3SAIS

A Developer’s View into Spark’s Memory ModelSummit 2017

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark's backend execution and push performance closer to the limits of modern hardware. In this talk, we'll take a deep dive into Apache Spark's unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection. Session hashtag: #SFdev25

A Developer’s View Into Spark’s Memory ModelSummit Europe 2017

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection. Session hashtag: #EUdd2