Skip to main content
Company Blog

“Dublin is now a truly cosmopolitan capital, with an influx of people, energy, and ideas infusing the ever-beguiling, multi-layered city with fresh flavors and kaleidoscopic colors,” writes the Lonely Planet. Add to this multi-layered city three days of Spark Summit, held at the Guinness barrel-shaped Dublin Convention Center, and you get an infusion of visionary ideas about the future of big data and Artificial Intelligence (AI) from speakers and attendees crossing oceans and traversing lands.

Despite the specters of Hurricane Ophelia and Brian, the third Spark Summit in EU drew a record attendance of over 1100 Spark enthusiasts from across the globe, allowing attendees to meet, mingle and mentor. With over 102 track sessions and 3 Apache Spark courses, conference attendees had a spectrum of Spark topics to select from; over 320 attendees learned about deep learning, machine learning, and tuning and configuring Spark by enrolling in the Spark training courses, offered by Databricks training team.

As a Spark community advocate, I was heartened to see such a huge turn up at the Pre Spark Summit Meetup, too. At 620 Meetup groups globally with 437K members, we are undoubtedly a global Spark community, and such mass gatherings at summits are a testament of Spark’s universal growth, adoption, maturity, and usage, in many industries, around the globe.

One Spark meetup organizer, Jean-Georges Perrin, captured the summit’s essence and convergence with this tweet:

In this blog, we have selected few favorite voices from the Spark community and Databricks and identified trends emerging for the future of Spark.

Simplifying AI with Deep Learning Pipelines

From the onset, the Databricks founders were committed to the vision of making big data simple, by providing developers with high-level APIs to make difficult things easier and possible. First, with Structure Spark APIs, then with Structured Streaming, and now with Deep Learning Pipelines.

“Our philosophy has always been to make Spark simple and a unified engine with composable, high-level APIs so that other fast-emerging workloads can be easily integrated into the engine,” said co-founder and creator of Spark Matei Zaharia.

He explained why streaming and deep learning workloads are complex and elaborated on how Databricks, working with the community, simplified them in Apache Spark. By not worrying about learning how to string together myriad streaming engines for their workloads but instead using the high-level APIs that allow them to build end-to-end streaming applications, developers can be far more productive. They are freed from the shackles of configuring and managing clusters. Same is true with deep learning pipeline APIs.

To demonstrate easy integration and simplification, Sue Ann Hong, software developer and co-author of Deep Learning Pipelines, demonstrated writing a deep learning application with 7 lines of code, under 10 minutes, and with 0 labels. Sue Ann Hong and Tim Hunter, creators of Deep Learning Pipeline (DLP), further elaborated the what-why-and-how of DLP deep-dive session to a packed house.

In addition, Matei announced the community efforts and contributions to Spark 2.2 and 2.3, including cost based optimizer, major performance and package improvements for PySpark, and Kubernetes support.

Simplifying Data Architectures with Databricks Delta

Continuing with the philosophy of unification and all the merits it affords to big data practitioners, Ali Ghodsi, CEO and co-founder of Databricks, announced Databricks Delta, a new unified data management system for real-time data as part of Unified Analytics Platform.

Ghodsi emphasized in his keynote that Databricks Delta achieves three goals: a) the reliability and query performance of a data warehouse b) the availability of querying structured data at the speed of streaming systems and c) the scalability and cost-efficiency of a data lake.

“Because Delta is a unified data management system that handles both low-latency streaming data and batch processes, it allows organizations to dramatically simplify their data architectures,” Ghodsi explained.

To demonstrate how Delta simplifies data architecture with a use-case of Databricks Delta, currently being used at a fortune 100 company for an Info Sec application that processes trillions of records a day, Michael Armbrust, lead software engineer for Delta, showed the power and potential of this unified data management system.

Highlighting Spark Community Voices

At CERN, the birthplace where Tim Berners Lee created the web, data generated by the electron accelerator are voluminous, explained Jakub Wozniak, senior software engineer in the CERN’s Beam department. All the magnets, devices, controls, and protons generate over I PB of raw physics data events per second when photons accelerate along the world’s largest electron accelerator. How do you capture, analyze, and visualize data at this scale?

Wozniak declared Spark as their choice for data extraction and analysis because of Spark’s distributed processing capabilities at scale, after trying and comparing other systems. In his keynote, Wozniak explains why.

Closely related to CERN’s need to analyze voluminous data at scale and speed is the need to ensure monitoring and performance of Spark jobs. Luca Canali, lead engineer at CERN’s Hadoop, Spark, and Database services, shared a dense and packed talk on Spark’s performance troubleshooting at scale, challenges, tools, and methodologies employed at CERN. This trend of performance and monitoring Spark workloads was also echoed in another talk by Jacek Laskowski.

For Jim Downing, this summit in Dublin was a homecoming event. A native of Ireland and a graduate of Trinity College, Dublin, Tim presented Tensorflow and Spark workflows. He analyzed different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines, and suggested how AI is going to evolve over time.

But what struck me most was his insight into what AI event will not stick out when the history of AI will be written a few decades from today, just as how one event stuck out when the history of gravity was written when Edmund Halley visited Isaac Newton in 1684 and posed a question: “What type of curve does the planet describe in its orbit around the sun, assuming inverse square of attraction?”

In the AI’s case, Downing speculated that one event that won’t make it into the history that everyone is talking about today is Facebook’s account on “How do we scale AI?”

The Women in Big Data lunch, presented by Jessica McCarthy & Marina B. Alekseeva, charted the evolution of this global and pivotal movement to inspire and encourage women’s journey into the field of big data. Part of the second day keynote Low Touch Machine Learning, Leah McGuire informed us how important human factor is in the AI world of big data applications.

Finally, one emerging theme that was quite apparent at this Spark Summit was scale, scale, and scale—whether your use case is in streaming and storing 1 trillion records a day with Databricks Delta, examining I PB raw physics data events per second at CERN or building complex deep learning and Machine Learning models at Salesforce. Add to this thematic story of scale the fascinating account and use case of building a scalable global geopolitical news scanner, scraper, and aggregator using Structured Streaming K-means, Apache Kafka, Machine Learning, and Apache Nifi.

What’s Next

Since nobody can attend all the talks at the submit, we have posted all the videos and slides on the Spark Summit website. Please peruse at your leisure.

If you missed your Spark moment in Dublin, you can catch the next one in San Francisco in June 2018. Even better, try to submit a talk. The call for presentations (CfP) for next Spark Summit will be announced shortly, so stay tuned.