You might have heard the famous saying, “Why software is eating the world.” But if software is eating the world, you may ask, where does software come from?
Naturally, Developers! Some software developers advocate that the “Developers are eating the world.” A research report by Stripe indicates that “developers have the ability to raise global GDP by $3 trillion over the next 10 years.” Perhaps so.
But their dominance at data-driven companies to produce data products affecting revenue is uncontested; their contributions to open-source projects on GitHub is unmatched; their contribution and presence at technical conferences are notable and influential; and their commitment to open-source community meetups is enduring.
In this blog, we highlight selected sessions by developers for developers that speak of their endeavors in combining the immense value of data and machine learning across sessions focused on Developer, Deep Dives, and Tutorials.
Developer and Deep Dives
Naturally, let’s start with the Developer track. Messrs Martin Junghanns and Sören Reichardt of Neo4J will share a new contribution to Apache Spark 3.0: Extending Spark Graph for the Enterprise with Morpheus and Neo4j. A new module for graphs in Spark, this session introduces how to transform data into Property Graphs using Morpheus and Cypher APIs.
Related to Graphs in Apache Spark, Dr. Victor Lee and Songting Chen of TigerGraph will compare three options for using graphs in Spark: GraphX, Cypher for Apache Spark, and TigerGraph. Don’t miss his talk, Assessing Graph Solutions for Apache Spark.
Both contributions from the community enhance and extend Spark with graphing capabilities.
Which brings us to Spark’s extensibility. Among many features that attract developers to Spark, one is its extensibility with new language bindings, libraries or extension of its components. Messrs Terry Kim and Rahul Potharaju of Microsoft will explain how they extended Spark to include a new .NET bindings in their talk: .NET bindings for Apache Spark.
Another session that shows Spark extensibility is a deep dive and live coding session, Extending Spark SQL 2.4 with New Data Sources. Jacek Laskowski, an independent consultant and author of Apache Spark Internals, will show in a live coding session how developers can extend Spark SQL with new or customized data sources.
The new open-source project Delta Lake extends Apache Spark to add ACID reliability to Data Lakes. In the talk, Databricks Delta Lake and Its Benefits, Nitin Raj and Nagaraj Sengodan of Cognizant Worldwide Limited will share how Delta Lake APIs are completely compatible with Apache Spark and how its transactions capabilities bring reliability to Data Lakes.
For software developers interested in internals and optimization of Apache Spark, a few sessions standout: First, Apache Spark’s Built-in File Sources in Depth, from Databricks Spark committer Gengliang Wang. In Spark 3.0, all data sources are reimplemented using Data Source API v2. This session will explain what those are and how to optimally use them.
Second, Luca Canali, from CERN, will explain performance troubleshooting of distributed data processing and improvements in Apache Spark 3.0 in his talk, Performance Troubleshooting Using Apache Spark Metrics.
Third, Spark tuning and optimization require knowledge of what configurations to tweak for optimal resource utilization. Four sessions elaborate on what–why-how of Spark tuning: Apache Spark Core – Practical Optimization (Daniel Tomes of Databricks); Using Production Profiles to Guide Optimizations (Adam Barth of Facebook); The Parquet Format and Performance Optimization Opportunities (Boudewijn Braams of Databricks); and Internals of Speeding up PySpark with Arrow (Ruben Berenguel, big data consultant)
MLflow, Delta Lake, Koalas, and Morpheus Tutorials
First introduced as dedicated 90-minute hands-on tutorial at Spark + AI Summit this year in San Francisco, tutorials had tremendous success in attendance and technical content, so we want to make this part of the summit in Amsterdam too. Here are few tutorials that are worth attending:
- Graph Features in Apache Spark 3.0: Integrating Graph Querying and Algorithms in Spark Graph
- Cosmos DB Real-time Advanced Analytics Workshop
- Koalas: pandas on Apache Spark
- Managing Machine Learning Life Cycle Management with MLflow
- Building Reliable Data Pipelines with Delta Lake
You can also peruse and pick sessions from the schedule. If you have not registered for the summit, use “Jules20,” a 20% discount code. In the next blog, we will share our picks from sessions related to Data Science, Deep Learning, Machine Learning, and AI Use Case tracks.