Two weeks ago we held a live webinar – Databricks’ Data Pipeline: Journey and Lessons Learned – to show how Databricks used Apache Spark to simplify our own log ETL pipeline. The webinar describes an architecture where you can develop your pipeline code in notebooks, create Jobs to productionize your notebooks, and utilize REST APIs to turn all of this into a continuous integration workflow.
We have answered the common questions raised by webinar viewers below. If you have additional questions, please check out the Databricks Forum.
Common webinar questions and answers
Click on the question to see answer:
- How can I use jstack to debug what the threads are doing if I do not have SSH access to machines?
- In the recommended approach we were to reduce the number of partition by dropping column of the date of log. However, in optimizing output we were to include the date in the directory path to evenly distribute data into different partitions – are these contradictory recommendations?
- What would happen regarding the persistence of the data in Parquet files when working with streaming instead of the current solution? Do you foresee issues with things like ACID when writing the parquet tables?
- Can we use Databricks and Apache Spark for an “Operational Data Store”? Meaning data ingested as batches, incremental when user update the previously loaded data.
- Is there a way to get the content of the logs (driver for example) as the data is being appended to the files while running job? Meaning that not waiting for the job to finish in order to see the logs?