Skip to main content

Apache Hadoop was created more than 15 years ago as an open source, distributed storage and compute platform designed for large data sets and large-scale batch processing. Early on, it was cheaper than traditional data storage solutions. At the time, businesses didn't need to run it on particular hardware. The Hadoop ecosystem also consists of multiple open source projects, and it can be deployed both on-premises and in the cloud, but it's complex.

But 15-year-old technology isn’t designed for the workloads of today. When it comes down to it, Hadoop is a highly engineered system with a zoo of technologies. It’s resource-intensive with the need for highly skilled people to manage and operate the environment. With data growth and the need for more advanced analytics like AI/ML, we've seen very few advanced analytic projects deployed in production on Hadoop. Lastly, it failed to support the fundamentals of analytics as well. In a previous blog, we explored the high financial and resource taxes of running Hadoop; the environment is fixed, services are operating 24/7, the environment is sized for peak processing, can be costly to upgrade, and is maintenance-intensive. Organizations need dedicated teams to keep the lights on, and the system’s fragility affects their ability to get value from all their data.

Effectively tapping into AI/ML and the value of all your data requires a modernized architecture. This blog will walk through how to do just that and the top considerations when organizations plan their migration off of Hadoop.

Importance of modernizing the data architecture 

An enterprise-ready modern cloud data and AI architecture provides seamless scale and high performance, which go hand in hand with the cloud in a cost-effective way.  Performance is often underestimated as a criterion, but the shorter the execution time, the lower the cloud costs.

It also needs to be simple to administer so that data teams can focus more on building out use cases, not managing infrastructure. And the architecture needs to provide a reliable way to deal with all kinds of data to enable predictive and real-time analytics use cases to drive innovation. Enter the Databricks Lakehouse Platform, built from the ground up on the cloud supporting \AWS)\, \ Azure, and\ GCP. It's a managed collaborative environment that unifies data processing, analytics via Databricks SQL, advanced analytics like data science and machine learning (ML) with real-time streaming data. This removes the need to stitch multiple tools and worry about disjointed security or move data around -- data resides in the organizations’ cloud storage within Delta Lake. Everything is in open format accessed by open source tooling, enabling organizations to maintain complete control of their data and code.

Top considerations when planning your migrating off of Hadoop

Top considerations when organizations are planning their migration off of Hadoop

Internal questions

Let's start by talking about planning the migration. There are several things data teams, CIOs, and CDOs need to go through, as with any journey. Most will start with the questions, where am I now? Where do I need to go? They then assess the composition of the current infrastructure and plan for the new world along the way. There will be a lot of new learnings and self-discovery that happens at this point. Data teams will test and validate some assumptions. And finally, they can execute the migration itself. A set of questions organizations should ask before starting the migration include:

  • Why do we want to migrate? The value is no longer there, you’re not innovating as fast as your competition, the promise of Hadoop is no longer there. There's a costly license renewal coming up at the end of life for a particular version of our Hadoop environment or a hardware refresh on the horizon that the CIO and CFO want to avoid. Possibly all of the above and more.
  • What are the desired start and end dates?
  • Who are the internal stakeholders needed for buy-in?
  • Who needs to be involved in every stage? This will help map what resources will be required.
  • Lastly, how does the migration fit into the overall cloud strategy? Is the organization going to AWS,  Azure, or GCP?

Migration assessment

Organizations must start by taking an inventory of all the migration items. Take note of the environment and various workloads, and then prioritize the use cases that need to be migrated. While a big bang approach is possible, a more realistic approach for most will be to migrate project by project. Furthermore, organizations will need to understand what jobs are running and what the code looks like. In most scenarios, organizations also have to build a business justification for the migration, including calculating the existing total cost of ownership and forecasting and the cost for Databricks itself. Lastly, organizations will have a better sense of their migration timeline and alignment with the originally planned schedule by completing the migration assessment.

Technical planning phase

The technical phase carries a significant amount of weight when it comes to Hadoop migration. Here, organizations need to think through their target architecture and ensure it will support the business for the long term. The general data flow will be similar to what is already there. In many cases, the process includes mapping older technologies to new ones or simply  and optimizing them. Organizations must also assess how to move their data to the cloud with the workloads. Will it be a lift and shift or perhaps something more transformative leveraging the new capabilities within Databricks? Or a hybrid of both? Other considerations include data governance and security, and the introduction of automation where possible, ensuring a smooth migration as it can be less prone to error and introduces repeatable processes. Here, organizations should also ensure that existing production processes are carried forward to the cloud, tying into existing monitoring and operations.

Evaluation and enablement

It’s essential to understand what the new platform has to offer and how things translate. Databricks is not Hadoop, but it provides similar functionality at greater performance and scale for all the data in data processing and data analytics. It’s also recommended to conduct some form of an evaluation, targeted demos, perhaps workshops, or jointly plan a production pilot to vet an approach for the environment.

Migration execution

The last consideration is executing the migration. Migration is never easy. However, getting it done right the first time is critical to the success of the modernization initiative and how quickly the organization can finally start to scale its analytics practices, cut costs and increase overall data team productivity. The organization should first deploy an environment, then migrate use case by use case, by moving across the data, then the code. To ensure business continuity, the organization should consider running workloads on both Hadoop and Databricks. Validation is required to ensure everything is identical in the new environment. When things are great, the decision can be made to cut over to Databricks and decommission the use case from Hadoop. Organizations will rinse and repeat across all the remaining use cases until they are all transferred across, after which the entire Hadoop environment can be decommissioned.

Migration off of Hadoop is not a question of ‘if’ but ‘when’

A lot of credit goes to Hadoop for the innovation it fueled from the time of its inception to even a few years ago. However, as organizations look to do more with their data, empower their data teams to do more analytics and AI, and less infrastructure maintenance and data management, the world of data and AI is in need of a Hadoop alternative. Organizations worldwide have realized that it’s no longer a matter of if migration is needed to stay competitive and innovate, but a matter of when. The longer organizations wait to evolve their data architecture to meet the growing customer expectations and competitive pressures, the further behind they fall while incurring increasing costs. As organizations begin their modernization journey, they need a step-wise approach that thoroughly explores each of the five considerations across the entire organization and not only within silos of the business. To learn more about the Databricks migration offerings, visit databricks.com/solutions/migration.

Migrating From Hadoop to Data Lakehouse for Dummies
Get faster insights at a lower cost when you migrate from a legacy Hadoop architecture to the Lakehouse.

Try Databricks for free

Related posts

Using Structured Streaming with Delta Sharing in Unity Catalog

We are excited to announce that support for using Structured Streaming with Delta Sharing is now generally available (GA) in Azure, AWS, and...

Databricks on AWS Guide to Data + AI Summit 2023 featuring Labcorp, Conde Nast, Grammarly, Vizio, NTT Data, Impetus, Amgen, and YipitData

This is a collaborative post from Databricks and Amazon Web Services (AWS). We thank Venkat Viswanathan, Data and Analytics Strategy Leader, Partner Solutions...

Announcing Brickbuilder Solutions for Migrations

August 11, 2022 by Michael Lumb in
Today, we're excited to announce that Databricks has collaborated with key partners globally to launch the first Brickbuilder Solutions for migrations to the...
See all Data Strategy posts