Distributed DataFrame (DDF) on Apache Spark: Simplifying Big Data For The Rest of Us

In this talk we will present the underlying abstraction called Distributed DataFrame (DDF) that powers the rapid construction of applications like Adatao pInsights and pAnalytics, directly on top of Spark RDDs. This has enabled Adatao to provide easy interfaces such as Natural Language, R, and Python into the underlying Spark/Shark engine.

DDF’s goal is to make the Big-Data API as simple and accessible to scientists and engineers as the equivalent “”small-data”” RDBMS API. The core idea behind DDF is to combine decades of wisdom in (a) RDBMS, (b) R Data Science, and (c) Distributed Computing, and provide the API user with a simple yet rich set of idioms such as friendly SQL queries, easy data table filtering and projections, transparent handling of missing data, quick access to machine-learning algorithms, and yet with direct access to the underlying Spark RDD representation as needed.

DDFs bring huge benefits to their users: many of the well-established idioms of RDBMS and data-science are accessible within one or two lines of code, yielding high analytic application-development productivity.

DDF’s architecture is componentized and pluggable by design, even at run-time, making it easy for users to replace or extend any component (“”handler””) at will without having to modify the API or ask for permission.

« back
About Christopher Nguyen

Christopher Nguyen is CEO and co-founder of Arimo, the leader in enterprise big apps. Previously, he served as engineering director of Google Apps and co-founded two successful startups. As a professor, Christopher co-founded the Computer Engineering program at HKUST. He earned his B.S. degree from the University of California-Berkeley summa cum laude, and a Ph.D. from Stanford, where he created the first standard-encoding Vietnamese software suite, authored RFC 1456, and contributed to Unicode 1.1. He is a co-creator of the open-source Distributed DataFrame project http://ddf.io.