A few months ago, we held a live webinar — Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read — which covered how to build a Just-in-Time Data Warehouse on Databricks with a focus on performing Change Data Capture from a relational database and joining that data to a variety of data sources.
The webinar is accessible on-demand. Its slides and sample notebooks are also downloadable as attachments to the webinar. Join the Databricks Community Edition beta to get free access to Apache Spark and try out the notebooks.
We have answered the common questions raised by webinar viewers below. If you have additional questions, please check out the Databricks Forum.
Common webinar questions and answers
Click on the question to see answer:
- Replacing ETL would be great. Costs for my enterprise data warehouse is killing me (both Oracle and Teradata). Could I take it a step further and use Spark, along with a NoSQL DB like Mongo or Cassandra, and an underlying Hadoop layer for storage, totally replace both my ETL layer and EDW?
- Regarding JSON, if I had a series of individual JSON files in an S3 bucket, could I apply a "SQL" query using schema-on-read across multiple JSON files at once?
- On one of the first CDC slides, a records was shown with a date of 1/2 and amt of $250. Then there was an update in the source db to change the amt of $350 on 1/5. a second row was added to the target db, now there were two rows, one with $250, one with $350. Both rows in the target db showed the updated date of 1/5. Was that intentional, updating the last updated date on the original row in the target db? I would have assumed the $250 row shouldn't have had it's last updated date changed.
- Can you share some ideas on how to handle column renames as well?
- Can parquet on S3 and Spark actually replace an MPP data warehouse such as Teradata or Redshift and still get the same MPP performance?