Anshuman is a senior data scientist at Alpine Data; his professional experience and interests lie at the intersection of machine learning, computer science, and financial markets. Prior to joining Alpine, Anshuman worked as a product manager at Ayasdi and at Mathworks, and as a trader at Barclays Capital and Futures First, where he built statistical models to trade US and European equity index futures and G10 currencies. Anshuman holds a BTech and an MTech in Computer Science and Engineering from IIT, and an MBA from UNC Kenan-Flagler.
Topological data analysis (TDA) has been a hot topic in the data science community for the last 10 years. The TDA Mapper algorithm presents a mathematically elegant way to investigate the structure of complex high-dimensional datasets and isolate localized patterns that are amenable to statistical modeling. While open-source implementations of the core TDA Mapper algorithm exist, they have been implemented in languages such as R, MATLAB, and Python. As a result, they are limited to small-sized datasets, single node execution and modest performance. In our talk, we present the first open-source scalable implementation of the Mapper algorithm for topological data analysis using Spark. We present the results of our tests on enterprise-scale datasets and highlight why a Spark implementation is a prerequisite for widespread adoption in Enterprise data science. Finally, we talk about the challenges faced and the key insights gleaned from this exercise in scaling algorithms to enterprise data. Key takeaways: - Open source scalable implementation of TDA Mapper - Key learnings associated with scaling complex analytics to terabyte scale datasets