Skip to main content
Company Blog

Apache Spark is tackling new frontiers through innovations by unifying new workloads. This enables developers to combine data and AI to develop intelligent applications. Developers come to this summit not just to hear about innovations from contributors. They come to share their use cases, experiences, and absorb knowledge.

In this final blog, we shift our focus to these developers who make a difference, not only in their contributions to the Spark ecosystem but also in use of Spark at scale in their respective industries.

Let’s start with large-scale feature aggregation using Apache Spark at Uber Inc. that enables several thousand features to account for ML-based decision making and risk analysis. Developers Pulkit Bhanot and Amit Nene will reveal data’s transformational journey, show its architecture and the Spark ecosystem, and share how aggregated features reduce turnaround times for machine learning models.

As of 2018 Facebook has close to 2.18 billion users, one-third of the world’s population. This global usage generates tons of data that need processing with reliable architecture. Facebook software engineers Brian Cho and Ergin Seyfe will share how they handle shuffle reads when data reaches 300TB. And they will discuss what type of architecture can support such scale. Find out in their talks: SOS: Optimizing Shuffle I/O and Taking Advantage of a Disaggregated Storage and Computer Architecture.

The term über data processing aptly takes its meaning in this session: Efficiently Triaging CI Pipelines with Apache Spark: Mixing 52 Billion Events/Day of Streaming with 40 TB/Hour of Batch Processing. In this fascinating talk about Spark at scale, software engineer Ivan Jibaja from Pure Storage will share how to write single application for both streaming and batch jobs at scale. Also, learn about building state-of-the-art Continuous Integration(CI) pipelines at scale, as his title suggests.

Now, the human genome and its sequencing have been at the forefront among scientists in the Health and Life Sciences (HLS). Thanks to advancement in big data analytics at scale, in particular because of Spark’s ability to process distributed data at scale and because of cheap cloud storage. Software Engineers Ram Sriharsha and Frank Austin Nothaft from Databricks have a novel solution to build genomic ETL pipelines in the cloud atop Apache Spark. As a biochemist or molecular biologist or a developer in the HLS industry, you will want to attend their session: Scaling Genomics Pipelines in the Cloud.

For developers interested in understanding the design motivations behind the evolution of Spark’s DataSource v2 APIs in Apache Spark 2.3, this deep-dive session from Databricks Spark committer and contributor Wenchen Fan and Gengliang Wang is for you. One notable usage of its new source and sink API enabled Continuous Processing in Structured Streaming, which will be discussed in another session from Jose Torres. Related to structured streaming is another immersive talk Deep Dive into Stateful Stream Processing in Structured Streaming from Spark committer Tathagata Das.

One of the big community contributions to Apache Spark 2.3 was the ability to run Spark natively on Kubernetes. With a native scheduler for Kubernetes within Apache Spark, you can now run Spark jobs natively. In their session, Apache Spark on Kubernetes Clusters, Messrs Anirudh Ramanathan and Sean Suchter will not only discuss how to build modern data pipelines in a Kubernetes native way but also unravel the future roadmap for native scheduler within Apache Spark.

Another community contribution in Spark 2.3 is Pandas UDF in Pyspark, which developer Jin Lin will talk in his session: Vectorized UDF: Scalable Analysis with Python and PySpark

Processing data at scale with Spark seems the underlying theme in these aforementioned sessions. Consider Apple’s case: Their requirements to handle data at speed and scale replace and augment traditional MapReduce workloads with Spark. In this talk, Apache Spark at Apple, software developers Sam Maclennan and Vishwanath Lakkundi will cover challenges of working at scale and lessons learned from managing large multi-tenant clusters, consisting of exabyte storage and million cores.

Finally, to comprehend what’s blockchain and why it matters, MIT Technology Review dedicated a quarterly issue on this subject. In blockchain we trust, its authors argue that you have to look beyond wild speculation and focus on what’s being built underneath. Even better, you can find out how the underlying infrastructure and its awesome technology is built atop Apache Spark in the Analyzing Blockchain Transactions in Apache Spark session from software developer Jiri Kremser: A fascinating talk, to say the least!

What’s Next

Take advantage of this promo code JulesPicks for a $300 discount and register now. Come find out what’s new in Spark, Data, and AI, and see you in San Francisco!

Read More