Migrating from RDBMS Data Warehouses to Apache Spark – Databricks

Migrating from RDBMS Data Warehouses to Apache Spark

Many companies are migrating their data warehouses from traditional RDBMS to BigData, and, in particular to Apache Spark. This usually requires a lot of effort and time: most of the developers used to work with RDBMS, in fact, need to quickly ramp-up in all big-data technologies in order to achieve the goal. Having faced this problem multiple times, at DBS Bank, we implemented a Spark-based application which helps during this migration process. The application embeds the Spark engine and offers a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native SparkSQL, or other flavours of SQL (i.e. TDSQL).

In the latter case an intermediate layer translates vendor-specific SQL constructs into Dataset operations (whenever possible) in order to leverage the features of the Catalyst engine. To offer RDBMS-like operations, the software is integrated with CarbonData as a storage layer, allowing users to perform update or delete operations on data. Among other things, the UI offers the possibility of validating procedures and performing data comparisons tasks between different datasets. To simplify deployment, each job can be packaged and released individually. The software produces a metadata file which is capable of driving the execution of the same transformations defined in the UI, in a batch fashion to be run in a production environment.

During the talk we will showcase all the above features and explain how each one of them are helping ETL developers to migrate traditional RDBMS SQL code to Spark in DBS Bank.

Session hashtag: #SAISExp16

« back
About Matteo Pelati

Matteo is the Head of Data Engineering at DBS bank overseeing the design and development of the entire DBS big data software platform. Matteo has more than 15 years of experience in software engineering. In the recent years he has been focusing on scalable BigData platforms and machine learning, specifically using Hadoop and Spark. Matteo has previously held different roles in startup companies and MNCs: he has led engineering teams at DataRobot, Bubbly, Microsoft, and Nokia.

About Chandra Sekhar Saripaka

Chandra Sekhar Saripaka is a product developer, big data professional, and data scientist. He has a deep experience in financial products, CMS, and identity management and is an expert in data crunching at terabyte scale on graphs and Hadoop. Previously, Chandra carried out research on image search indexing and retrieval and has built many architectures on enterprise integration and portals, a cloud search engine for e-commerce, and a framework for real-time news recommendation systems. Chandra is currently a Principal Engineer at DBS Bank.