Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At Alpine Data Labs, we think what we’re up to is pretty fun and challenging, but we still have to compete with other start-ups as well as the big internet companies to attract the best talent. One thing that can help is to be able to say that you’re working with the most innovative and powerful technologies.
Last year, I was interviewing a talented engineer with a strong background in machine learning. And he said that the one thing he wanted to do above all was to work with Apache Spark. “Will I get to do that at Alpine?” he asked.
If it had been even a year earlier, I would have said “Sure…at some point.” But in the meantime I’d met several of the members of the AMPLab research team at Berkeley, and been impressed with their mature approach to building a platform and ecosystem. And I’d seen enough companies installing Spark on their dev clusters that it was clear this was a technology to watch. In a remarkably short time, it went from experimental to very real. And now prospects in the Alpine pipeline were asking me if it was on the roadmap. So yes, I told my candidate. “You’ll be working on Spark from day one.”
Last week, Alpine announced at GigaOM that it’s one of the first analytics companies to leverage Spark for building predictive models. We demonstrated the Alpine engine running on Pivotal’s Analytics Workbench, where it ran an iterative classification algorithm (logistic regression) on 50 million rows in less than 50 seconds.
Furthermore, we were officially certified on Spark by the team at Databricks. It’s been an honor to work with them and the research team at Berkeley. We think their technology will be a serious contender for the leading platform for data science.
Spark is more to us than just speed. It’s really the entire ecosystem that represents such an exciting paradigm for working with data.
Still, the core capability of caching data in memory was our primary consideration, and our iterative algorithms have been shown to speed up by one or even two orders of magnitude (thanks again to that Pivotal cluster).
We’ve always had this mantra at Alpine: “Avoid multiple passes through the data!” And we’ve designed many of our machine learning algorithms to avoid scanning the data too many times, packing on calculations into each MapReduce job like a waiter piling up plates to try and clear a table in one go. But it’s rare that we can avoid it entirely. With Spark, it’s incredibly satisfying to watch the progress bar zip along as the system re-uses data it’s already seen before.
Another thing that’s getting our engineers excited is Spark’s MLLib, the machine-learning library written on top of the Spark runtime. Alpine has long thought that machine learning algorithms should be open source. (I helped to kick off the MADlib library of analytics functions for databases, and Alpine now uses it extensively.) So we’re now beginning to contribute some of our code back into MLLib. And, moreover, we think MLLib and MLI have the potential to be a more general repository for open-source machine learning.
So I’ll congratulate the Alpine team for helping to bring the power of Spark to our users, and I’ll also congratulate the Spark team and Databricks for making it possible!