Announcing the Delta Lake 0.3.0 Release
We are excited to announce the release of Delta Lake 0.3.0 which introduces new programmatic APIs for manipulating and managing data in Delta tables. The key features in this release are: Scala/Java APIs for DML commands - You can now modify data in Delta tables using programmatic APIs for Delete (#44), Update (#43) and Merge...
Getting Data Ready for Data Science: On-Demand Webinar and Q&A Now Available
On June 25th, our team hosted a live webinar — Getting Data Ready for Data Science — with Prakash Chockalingam, Product Manager at Databricks. Successful data science relies on solid data engineering to furnish reliable data. Data lakes are a key element of modern data architectures. Although data lakes afford significant flexibility, they also face...
Open Sourcing Delta Lake
Build reliable data lakes effortlessly at scale We are excited to announce the open sourcing of the Delta Lake project. Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads...
Efficient Upserts into Data Lakes with Databricks Delta
Simplify building big data pipelines for change data capture (CDC) and GDPR use cases. Databricks Delta, the next-generation engine built on top of Apache Spark™, now supports the MERGE command, which allows you to efficiently upsert and delete records in your data lakes. MERGE dramatically simplifies how a number of common data pipelines can be...
Introducing Delta Time Travel for Large Scale Data Lakes
Data versioning for reproducing experiments, rolling back, and auditing data We are thrilled to introduce time travel capabilities in Databricks Delta, the next-gen unified analytics engine built on top of Apache Spark, for all of our users. With this new feature, Delta automatically versions the big data that you store in your data lake, and...
Apache Spark™ Clusters in Autopilot Mode
Apache Spark™ is a unified analytics engine that helps users use a single distributed computing framework for various use cases. With the advent of cloud computing, setting up your own platform using Apache Spark is relatively easy. There are also tools / services that cloud providers provide to make setup easier. However, the real hard...
Introducing Databricks Optimized Autoscaling on Apache Spark™
Databricks is thrilled to announce our new optimized autoscaling feature. The new Apache Spark™-aware resource manager leverages Spark shuffle and executor statistics to resize a cluster intelligently, improving resource utilization. When we tested long-running big data workloads, we observed cloud cost savings of up to 30%. What’s the problem with current state-of-the-art autoscaling approaches? Today,...
Transparent Autoscaling of Instance Storage
Big data workloads require access to disk space for a variety of operations, generally when intermediate results will not fit in memory. When the required disk space is not available, the jobs fail. To avoid job failures, data engineers and scientists typically waste time trying to estimate the necessary amount of disk via trial and...
What AWS Per-Second Billing Means for Big Data Processing
Databricks, the Unified Analytics Platform, has always been a cloud-first platform. We believe in the scalability and elasticity of the cloud so that customers can easily run their large production workloads and pay for exactly what they use. Hence, we have been charging our customers at per-second level granularity. Until last month, billing on AWS...
Access Control for Databricks Jobs
Secure your production workloads end-to-end with Databricks’ comprehensive access control system Databricks offers role-based access control for clusters and workspace to secure infrastructure and user code. Today, we are excited to announce role-based access control for Databricks Jobs as well so that users can easily control who can access the job output and control the...