Skip to main content
Engineering blog

Today we are excited to announce Brickchain, the next generation technology for zettabyte-scale analytics, by harnessing all the compute power on the planet. Brickchain is the most scalable, secure, and collaborative data technology ever invented.

As you may know, Databricks was founded by the original creators of Apache Spark, a unified analytics engine that uses massive parallelism to provide unparalleled performance for Big Data workloads. Until today, this parallel processing has been confined to operate within data centers, limiting scalability to only petabytes of data. As our customers become more and more data-driven, we see this becoming a severe bottleneck in scaling their businesses. Three years ago, we asked ourselves this question: how can we design the next generation data technology for zettabyte-scale?

Hundreds of prototypes later, a team in our European R&D center in Amsterdam came up with this working design: take parallelism to the next level by unlocking compute resources across the globe! It uses secure blockchain technology, commonly known as the technology behind Bitcoin and other cryptocurrencies, to securely distribute work across many independent providers of compute resources. We believe that by unlocking the compute resources available across the globe, Brickchain can tackle Big Data analytics workloads that are one million times larger.

In exchange for their efforts, the providers of compute resources automatically receive credits in a new cryptocurrency called Databrickcoins or simply Datas. From the perspective of the compute resource partners, they are simply Data mining, similar to how they would currently be doing Bitcoin mining. The incentives are exactly the same, and we expect that the intrinsic value of our new cryptocurrency will draw to our platform large amounts of compute resource providers, from data centers to mobile phones and smart microwaves.

So how does this all work? Well, remember that an Apache Spark job consists of a number of stages that are organized like a tree. Within each stage, there are many tasks that can be executed in parallel. Normally these tasks would run on a Spark cluster in a single data center, but instead, Brickchain distributes these tasks by creating a new blockchain and exposing the work through Brickchain computation brokers. Interested compute resource owners contact the broker to fetch the work items. They perform the requested calculations, and as an integral part of the calculation, they will also calculate a reward in the form of a Databrickcoin token. This token serves as a Proof of Work similar to how this works in Bitcoin. The outcomes of the calculations are then shared through the Brickchain broker and picked up by tasks from the next stage of the Spark job, and so on until the entire calculation is complete.

Now, you may ask, what about security? If the computations are done by arbitrary computers across the globe, then how is your valuable data protected against the prying eyes of the owners of those computers? You can rest assured that we have thought this through, and we have a novel solution. All of the data is encrypted using homomorphic encryption, which means that the data stays encrypted not only when it is transferred but also during the computation. As a result, the Brickchain compute providers have no way to actually look at the data.

We have successfully tested Brickchain on the industry standard TPC-DS benchmark, running at scale factor 1,000,000,000,000. To the best of our knowledge, no other system has been able to successfully complete the TPC-DS benchmark even at one million times less scale factor. Through our experiments, we also observed some inefficiencies for smaller scale data, sometimes up to one million times slower than the current generation technology, due to the use of blockchain and homomorphic encryption. We also noticed occasionally it would cause widespread Internet outages due to overheating of networking equipment. We have brought this up to major cloud providers and Telcos for them to take this into consideration in their 5G network design.

We hope that we have captured your interest in Brickchain, the most scalable, secure, and collaborative data technology ever invented. We will be providing more details about it at the upcoming Spark+AI Summit in San Francisco, and we will be making the product available for private preview in the second half of 2019. For the sake of not bringing down the Web and Internet, and after congenial conversations with Vint Cerf and Sir Berners Lee, we will make the product generally available in the beginning of Q2 2020, as 5G networks are being rolled out across the globe.

Join us if you are excited about defining the future of data technologies, without crashing the Web and the Internet.