Christopher Nguyen is CEO and co-founder of Arimo, the leader in enterprise big apps. Previously, he served as engineering director of Google Apps and co-founded two successful startups. As a professor, Christopher co-founded the Computer Engineering program at HKUST. He earned his B.S. degree from the University of California-Berkeley summa cum laude, and a Ph.D. from Stanford, where he created the first standard-encoding Vietnamese software suite, authored RFC 1456, and contributed to Unicode 1.1. He is a co-creator of the open-source Distributed DataFrame project http://ddf.io.
In this talk we will discuss how Adatao has successfully built a full-featured, powerful enterprise analytics solution with Spark. Features include web-based reporting/visualization/publishing (“basic analytics”) as well as real-time, interactive data mining and machine learning (“advanced analytics”) on large data sets. What used to take hours are now routinely accomplished in seconds. We will present architecturally how this was accomplished using Spark/Shark/HDFS and other subsystems, with Python and R scriptable front-ends. We will also discuss some use cases where large enterprises are successfully deploying this solution, and lessons learned.
In this talk we will present the underlying abstraction called Distributed DataFrame (DDF) that powers the rapid construction of applications like Adatao pInsights and pAnalytics, directly on top of Spark RDDs. This has enabled Adatao to provide easy interfaces such as Natural Language, R, and Python into the underlying Spark/Shark engine. DDF’s goal is to make the Big-Data API as simple and accessible to scientists and engineers as the equivalent “”small-data”” RDBMS API. The core idea behind DDF is to combine decades of wisdom in (a) RDBMS, (b) R Data Science, and (c) Distributed Computing, and provide the API user with a simple yet rich set of idioms such as friendly SQL queries, easy data table filtering and projections, transparent handling of missing data, quick access to machine-learning algorithms, and yet with direct access to the underlying Spark RDD representation as needed. DDFs bring huge benefits to their users: many of the well-established idioms of RDBMS and data-science are accessible within one or two lines of code, yielding high analytic application-development productivity. DDF’s architecture is componentized and pluggable by design, even at run-time, making it easy for users to replace or extend any component (“”handler””) at will without having to modify the API or ask for permission.
The Application Spotlight will highlight selected “Certified on Spark” applications that leverage Spark to help their users derive greater value from their data. For each application their will be a brief demo of key functionality followed by a fireside chat discussing the developers experience with Spark, lessons learned, and wish list for the future.