Skip to main content
Company Blog

Last week, we held a live webinar—Databricks for Data Engineers—to provide an overview of the data engineering role, common challenges data engineers face while building ETL pipelines, and how Databricks can help data engineers easily build production-quality data pipelines with Apache Spark.

Prakash Chockalingam, product manager at Databricks, also gave a live demonstration of Databricks and features that data engineers would benefit from such as:

  • Advanced cluster management functionalities that suit any workload requirements.
  • The ability to interactively build an ETL pipeline via an integrated workspace.
  • Simplified troubleshooting of jobs with monitoring alerts.
  • Job scheduling with helpful features like alerting, custom retry policies, and parallel runs.
  • Notebook workflows which allow you to build multi-stage production Spark pipelines directly from Databricks notebooks.

The webinar is now accessible on-demand, and the slides used in the webinar are also downloadable as attachments to the webinar.

We have also answered the common questions raised by webinar viewers below. If you have additional questions, check out the Databricks Forum or the new documentation resource.

If you’d like free access to Databricks, you can access the free trial here.

Common webinar questions and answers

Click on the question to see answer

How would you integrate an ETL pipeline in production with tools like Chef or Puppet, automatic testing tools for Continuous integration, and include other services?

Do you have any recommendations on the best architecture for integrating IoT data into Databricks using Apache NiFi to S3?

Can you please explain any one scenario where Spark with Yarn or Spark with Mesos can be a justified choice?

Can you please clarify R as a component of Spark?

Does your analytic layer include Spotfire?

Can you SSH into your EC2 instances?

How does Spark compare to Sqoop in transferring data from Oracle to HDFS?

Is it possible to restart a job from the failed notebook?

Does Databricks provides any APIs for notebook execution monitoring?

Is SparkSQL the only component used to build ETL pipelines?

Can we implement Type 2 logic using Spark and do inserts and updates to target an RDBMS?

What’s the main difference between Storm and Spark? Can data be processed in real time using Spark?