Lawrence Spracklen leads engineering at Alpine Data. He is tasked with the development of Alpine’s advanced analytics platform, which makes extensive use of Spark. Prior to joining Alpine, Lawrence worked at Ayasdi as both VP of Engineering and Chief Architect. Before this, Lawrence spent over a decade working at Sun Microsystems, Nvidia and VMware, where he led teams focused on the intersection of hardware architecture and software performance and scalability. Lawrence holds a Ph.D. in Electronic engineering from the University of Aberdeen, a B.Sc. in Computational Physics from the University of York and has been issued over 40 US patents.
Topological data analysis (TDA) has been a hot topic in the data science community for the last 10 years. The TDA Mapper algorithm presents a mathematically elegant way to investigate the structure of complex high-dimensional datasets and isolate localized patterns that are amenable to statistical modeling. While open-source implementations of the core TDA Mapper algorithm exist, they have been implemented in languages such as R, MATLAB, and Python. As a result, they are limited to small-sized datasets, single node execution and modest performance. In our talk, we present the first open-source scalable implementation of the Mapper algorithm for topological data analysis using Spark. We present the results of our tests on enterprise-scale datasets and highlight why a Spark implementation is a prerequisite for widespread adoption in Enterprise data science. Finally, we talk about the challenges faced and the key insights gleaned from this exercise in scaling algorithms to enterprise data. Key takeaways: - Open source scalable implementation of TDA Mapper - Key learnings associated with scaling complex analytics to terabyte scale datasets
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement, In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.