Hossein Falaki is a tech lead at Databricks, working on machine learning infrastructure. Prior to joining Databricks he was a data scientist at Apple’s personal assistant, Siri. He graduated with a Ph.D. in Computer Science from UCLA, and a Masters in Computer Science from University of Waterloo
We all know the unprecedented potential impact for Machine Learning. But how do you take advantage of the myriad of data and ML tools now available? How do you streamline processes, speed up discovery, share knowledge, and scale up implementations for real-life scenarios? In this talk, we'll cover some of the latest innovations brought into the Databricks Unified Analytics Platform for Machine Learning. In particular we will show you how to: - Get started quickly using the Databricks Runtime for Machine Learning, that provides pre-configured Databricks clusters including the most popular ML frameworks and libraries, Conda support, performance optimizations, and more. - Get started with most popular Deep Learning frameworks within a few minutes and go deep with state of the art model DL diagnostics tools. - Scale up Deep Learning training workloads from a single machine to large clusters for the most demanding applications using the new HorovodRunner with ease. - How all of these ML frameworks get exposed to large and distributed data using Databricks Runtime for Machine Learning.
Visualizing large data is challenging. There are more data points than possible pixels, and manipulating distributed data can take a long time. To address these challenges one solution is using custom rendering engines. But we think with Spark we can apply off the shelf and open source visualization tools such as, D3, Matplotlib, and ggplot, to very large data. This approach has several benefits. First, data scientists are already familiar with these tools. Second, the output of these tools can be readily shared with others on the web. Finally, separating data manipulation from rendering enables users to freely chose the best tool for their job. For example, if a graph needs to be interactive D3 is a better choice than Matplotlib. Apache Spark comes ready for this task. It enables interactive analysis of big data by reducing query latency to the range of human interactions through caching. Additionally, Spark’s unified programming model and diverse programming interfaces enable smooth integration with popular visualization tools. We can use these to perform both exploratory and expository visualization over large data. In this talk we will introduce the relevant Spark API for sampling and manipulating large data. We will also demonstrate how the API can be integrated with D3 and Matplotlib for end-to-end data visualization. Related Articles: