Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More

In this post, we look back and cover recent performance efforts in Spark. In a follow-up blog post next week, we will look forward and share with you our thoughts on the future evolution of Spark's performance. 2014 was the most active year of Spark development to date, with major improvements across the entire engine.…

Read

Big Graph Analytics with LynxKite & Spark

This is a guest blog from our one of our partners: Lynx Analytics  About Lynx Analytics Lynx Analytics is a data analytics consultancy firm with a focus on graph analytics and proprietary big graph analytics software development. We augment classical data mining methods with our expertise in graph analytics, and apply these methods against large datasets…

Read

Analyzing Apache Access Logs with Databricks Cloud

Databricks Cloud provides a powerful platform to process, analyze, and visualize big and small data in one place. In this blog, we will illustrate how to analyze access logs of an Apache HTTP web server using Notebooks. Notebooks allow users to write and run arbitrary Spark code and interactively visualize the results. Currently, notebooks support three…

Read

New MLlib Algorithms in Spark 1.3: FP-Growth and Power Iteration Clustering

This is a guest blog post from Huawei’s big data global team. Huawei, a Fortune Global 500 private company, has put together a global team since 2013 to work on Spark community projects and contribute back to the community. This blog post describes two new MLlib algorithms contributed from Huawei in Spark 1.3 and their…

Read

The Easiest Way to Run Spark Jobs

Recently, Databricks added a new feature, Jobs, to our cloud service. You can find a detailed overview of this feature here. This feature allows one to programmatically run Spark jobs on Amazon’s EC2 easier than ever before. In this blog, I will provide a quick tour of this feature. What is a Job? The job…

Read

Celtra Scales Big Data Analysis Projects Six-Fold with Databricks Cloud

We are thrilled to announce that Celtra selected Databricks Cloud to scale its big data analysis projects, increasing the amount of ad-hoc analysis done, six-fold. Press release: http://www.marketwired.com/press-release/celtra-scales-big-data-analysis-projects-six-fold-with-databricks-cloud-2009995.htm Celtra provides agencies, media suppliers and brand leaders alike with an integrated, scalable HTML5 technology for brand advertising on smartphones, tablets and desktop. The platform, AdCreator 4, gives clients such as MEC,…

Read

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

This is a guest post from Bob DuCharme. Original article appeared in: http://www.snee.com/bobdc.blog/2015/04/running-spark-graphx-algorithm.html Well, one algorithm, but a very cool one. Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark's GraphX…

Read

Deep Dive into Spark SQL’s Catalyst Optimizer

Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala's pattern matching and quasiquotes) in a novel way to build an extensible query…

Read

A Look Back at Spark Summit East

We are delighted about the success of the first Spark Summit East, held in New York City on March 18th. The summit was attended by a sold-out crowd of over 900 people from more than 300 organizations. Databricks is proud to make all talk videos, slides, training talk videos, and training materials available online for…

Read

Timeful Chooses Databricks Cloud to Enable Intelligent Time Management

We are thrilled to announce that Timeful chose Databricks Cloud to enable intelligent time management with data analytics. Press release: http://www.marketwired.com/press-release/timeful-chooses-databricks-cloud-enable-intelligent-time-management-with-data-2006609.htm Timeful helps its users manage their time better by tracking commitments, categorizing to-do list items and assisting in the development of good lifestyle habits. Deployed as an application on smart phones devices, Timeful utilizes machine learning to recommend…

Read