NTT DATA: Operating Spark clusters at thousands-core scale and use cases for Telco and IoT

This is a guest blog from our one of our partners: NTT DATA Corporation About NTT DATA Corporation NTT DATA Corporation is a Japanese IT solution provider and the global IT services arm of NTT (Nippon Telegraph and Telephone Corporation), which ranks among the top 10 telecommunication companies in the world by revenue. At NTT…

Read

Spark Summit 2015 in San Francisco is just around the corner!

  We’re proud to announce that the new Spark Summit website is live! This includes the full list of community talks along with the first set of keynotes. With over 260 submissions this year, the Program Committee had its work cut out narrowing the list to 54 talks. We would like to thank everyone who submitted…

Read

Project Tungsten: Bringing Spark Closer to Bare Metal

In a previous blog post, we looked back and surveyed performance improvements made to Spark in the past year. In this post, we look forward and share with you the next chapter, which we are calling Project Tungsten. 2014 witnessed Spark setting the world record in large-scale sorting and saw major improvements across the entire…

Read

Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More

In this post, we look back and cover recent performance efforts in Spark. In a follow-up blog post next week, we will look forward and share with you our thoughts on the future evolution of Spark's performance. 2014 was the most active year of Spark development to date, with major improvements across the entire engine.…

Read

Big Graph Analytics with LynxKite & Spark

This is a guest blog from our one of our partners: Lynx Analytics  About Lynx Analytics Lynx Analytics is a data analytics consultancy firm with a focus on graph analytics and proprietary big graph analytics software development. We augment classical data mining methods with our expertise in graph analytics, and apply these methods against large datasets…

Read

Analyzing Apache Access Logs with Databricks Cloud

Databricks Cloud provides a powerful platform to process, analyze, and visualize big and small data in one place. In this blog, we will illustrate how to analyze access logs of an Apache HTTP web server using Notebooks. Notebooks allow users to write and run arbitrary Spark code and interactively visualize the results. Currently, notebooks support three…

Read

New MLlib Algorithms in Spark 1.3: FP-Growth and Power Iteration Clustering

This is a guest blog post from Huawei’s big data global team. Huawei, a Fortune Global 500 private company, has put together a global team since 2013 to work on Spark community projects and contribute back to the community. This blog post describes two new MLlib algorithms contributed from Huawei in Spark 1.3 and their…

Read

The Easiest Way to Run Spark Jobs

Recently, Databricks added a new feature, Jobs, to our cloud service. You can find a detailed overview of this feature here. This feature allows one to programmatically run Spark jobs on Amazon’s EC2 easier than ever before. In this blog, I will provide a quick tour of this feature. What is a Job? The job…

Read

Celtra Scales Big Data Analysis Projects Six-Fold with Databricks Cloud

We are thrilled to announce that Celtra selected Databricks Cloud to scale its big data analysis projects, increasing the amount of ad-hoc analysis done, six-fold. Press release: http://www.marketwired.com/press-release/celtra-scales-big-data-analysis-projects-six-fold-with-databricks-cloud-2009995.htm Celtra provides agencies, media suppliers and brand leaders alike with an integrated, scalable HTML5 technology for brand advertising on smartphones, tablets and desktop. The platform, AdCreator 4, gives clients such as MEC,…

Read

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

This is a guest post from Bob DuCharme. Original article appeared in: http://www.snee.com/bobdc.blog/2015/04/running-spark-graphx-algorithm.html Well, one algorithm, but a very cool one. Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark's GraphX…

Read