IBM Watson Analytics for Social Media is using a pipeline for deep text analytics and predictive analytics based on Apache Spark. This session describes our journey from our predecessor product, which used Hadoop in environments dedicated per tenant, to a system based on Apache Spark (both “core” and streaming), Kafka and ZooKeeper that serves more than 3000 tenants. We will describe our thought process, our current architecture, as well as the lessons we’ve learned since we put the environment into production in December 2015. Key takeaways are: – Changes to design, development and operations thinking required when going from single-tenancy to multi-tenancy – Architecture of a multi-tenant Spark solution in production – Orchestration of several Spark apps within a common data pipeline – Benefits of Apache Spark, Kafka and ZooKeeper in a multi-tenant data pipeline architecture
Rubén is a Software Engineer at the IBM Research & Development Lab in Germany, currently working on Watson Analytics for Social Media using Scala, Java, Spark, Kafka, MongoDB and Zookeeper. He is passionate about building large scale distributed systems and applying machine learning on Social Media data. Previously he focused on improving the integration between Data Mining, Warehousing and BI systems, authoring several patents and publications. Ruben holds Master’s degrees in Information Technology from University of Stuttgart and in Telecommunication Engineering from Polytechnic University of Madrid.
Behar holds a degree in Computer Science from universities in Karlsruhe and Edinburgh. He is currently working as Software Engineer for IBM Watson Analytics for Social Media using Scala, Java, Spark, Kafka, ZooKeeper and MongoDB. He previously worked for one of the world's largest web hosting companies building a cloud storage backend on top of RedHat's Ceph distributed storage platform and Apache Cassandra. During his studies he interned at CERN and co-founded holidu, the world's largest search engine for vacation rentals.