Writing a Faster Jsonnet Compiler
This blog post is part of our series of internal engineering blogs on the Databricks platform, infrastructure management, integration, tooling, monitoring, and provisioning. See our previous post on Declarative Infrastructure for background on some of the technologies discussed here. At Databricks, we use the Jsonnet data templating language to declaratively manage configurations for services running...
Top 5 Reasons for Choosing S3 over HDFS
At Databricks, our engineers guide thousands of organizations to define their big data and cloud strategies. When migrating big data workloads to the cloud, one of the most commonly asked questions is how to evaluate HDFS versus the storage systems provided by cloud providers, such as Amazon’s S3, Microsoft’s Azure Blob Storage, and Google’s Cloud...
Introducing Redshift Data Source for Spark
This is a guest blog from Sameer Wadkar, Big Data Architect/Data Scientist at Axiomine. The Spark SQL Data Sources API was introduced in Apache Spark 1.2 to provide a pluggable mechanism for integration with structured data sources of all kinds. Spark users can read data from a variety of sources such as Hive tables, JSON files, columnar Parquet...
Project Tungsten: Bringing Apache Spark Closer to Bare Metal
In a previous blog post, we looked back and surveyed performance improvements made to Apache Spark in the past year. In this post, we look forward and share with you the next chapter, which we are calling Project Tungsten. 2014 witnessed Spark setting the world record in large-scale sorting and saw major improvements across the...