The leading enterprises continue to drive digital transformation and are modernizing their data architecture to take advantage of the many economic and functional benefits enabled by the cloud. While the move to the cloud is making companies more competitive, lean and nimble, many technical teams are concerned about the complexities and business risks associated with large scale data migrations.
Join technical experts from Infosys and WANdisco as they share technical insights about the risks and costs associated with large scale data migrations. Learn how technical teams can avoid these business risks by leveraging a LiveData approach using WANdisco solutions, and how Infosys and WANdisco have recently worked together on behalf of a global retailer on a successful 3.5 petabyte business-critical data migration project, completing it in 72 days with minimal business disruption and zero data loss.
– Hello everybody, this is Paul-Scott Murphy from WANdisco. I’m gonna be presenting to you to today with Mada from Infosys on Managing the Business Risks of Large-scale Hadoop Migrations.
So the agenda for the conversation will really be talking through some of the drivers that organizations are faced with in cloud migration. We’ll be talking specifically about the challenges that emerge from that with large-scale data migrations, what we refer to as the data migration gap. The risks and technical challenges that emerge at scale differ from those for small scale environments, so we really wanna give a clear understanding of what those are. We’ll make reference to industry studies, and the technical approaches that WANdisco takes to resolving those challenges with we call our LiveData platform. Mada will talk in specific detail about the application of this technology, along with the processes and blueprints that Infosys provides to lend their expertise to large-scale data migration, and make reference as well to a case study for a U.S. retailer that’s taken this approach for a hugely successful migration of multiple petabytes of data in the Azure cloud environment. So we’ll talk through the technical details behind that and give you a really clear understanding of how this applies in environments a platform such as the Databricks platform are in constant use for analytics infrastructure. So with that, I’ll hand over to Mada to explain Infosys’ role in our partnership. – Let me start with reinforcing the importance of migration and modernization in any enterprise’s journey.
You know, what our vision for any enterprise which is data oriented is for them to become what we call a live enterprise where, you know, everything is alive. It feels like an entity which is like any other living thing. The foundation for this is absolutely the modernization journey, including migration to the cloud that takes place and it’s (mumbles) it’s an imperative for any enterprise. And the challenges though are the fact that data applications are no more tier two. Applications in most enterprises, they are tier one, they have huge volumes, even compared to a few years ago. That makes this a little challenging for us to do migration the old-fashioned way.
What happens when you have a large volume of data, the processes and tools that we use for this, for solving this, when the scale was smaller in the gigabytes and low terabytes is these tools and process fail.
We have done a lot of migrations over the years. As the scale grew, the processes and technologies that we use for this do have to keep pace with it, and that is due to the fact that outages are unacceptable. You know, if we have to be down for days at a time that wouldn’t be acceptable to most of our customers. Applications need to keep on moving while the migration is taking place. And the scale of the migration would further compound that challenge, and this is where we have brought together the technology from WANdisco’s LiveData platform and the processes and techniques that we use at Infosys, to do this in a seamless, painless way for our customers, and launching them toward success in their journey to become live enterprise. That’s it, hand it over to Paul to talk more about what technology we use under the hood and talk to us more about that. – Thanks very much, Mada.
WANdisco plays a role in large-scale data migrations that’s very specific to the challenges that emerge when data exists at scale. He’s already made reference to the fact that the continued operation of applications is critical for many organizations looking to migrate to cloud infrastructure, perhaps to modernize their application and analytics platform, maybe from an on-premises Hadoop environment to using Spark and Databricks in the cloud. But there are many other reasons for migration as well. So really what we refer to as the data migration gap is how you go about answering that question of the challenges that emerge with data at scale. How do you migrate potentially petabytes or exabytes of what’s increasingly becoming business critical information, without causing disruption to the applications and platforms that take advantage of those data sets? Ensuring data consistency between environments, as well, so that you can guarantee your data are available in the form you need at the outcome of your migration, can be important for that. So WANdisco refers to this, as I said before, as the data migration gap.
I wanna make reference to an industry study that we’ve conducted in order to determine, you know, what’s important for organizations looking to migrate to more modern analytic platforms in the cloud, and really the focus there was around the business risks associated with large scale data migration. In querying organizations through this study, it was very clear in the responses that were obtained from that that there were a few key challenges that most businesses saw as critical to their migration to the cloud. In the first instance, the majority of organizations were unwilling to accept business disruption throughout migration. In fact, the vast majority indicated that only a matter of minutes or hours of down time at most, if not, no down time was all that was acceptable for the purpose of large-scale data migration, and if you think about what that requires when you’re conducting a migration, it really means that applications need to continue operating throughout the migration itself. It takes time to move data at scale, and you can’t afford the option of ceasing operations while that migration is underway. The second key result that came out of that study was the questions around data consistency. Of course, organizations using large-scale data platforms, Hadoop environments, Apache Spark environments, Cloud Object Storage, and the like, are increasingly doing so for what’s becoming business critical data. They’re running their organizations on it. They’re integrating with their partners, and employing the ability for access to the analytic platforms and the data itself to be available with confidence that the data represent what the business needs to use and consume. So the outcome of requiring data consistency throughout migration and being able to guarantee that data are not lost or modified through migration is very important as well. The third key element that we saw in response to the survey was the concerns around the level of effort involved in actually performing a migration. Application migration, data migration itself can be time-consuming because of the need for the effort involved in dealing with the details of accommodating the migration needs throughout. Any chance of that being automated or optimized either through technology or process then becomes hugely important for how organizations address what we refer to as the data migration gap.
So I’ll talk a little bit about WANdisco’s approach to this which we term LiveData. WANdisco’s technology platform is geared specifically towards answering those questions around how you can migrate to modern analytics infrastructure, particularly in cloud environments without incurring the overheads associated with the expended IT effort, or risking data consistency, or needing to incur down time in applications while a large-scale migration is conducted. So LiveData is what we refer to as the ability to guarantee that your critical data remain continuously available throughout migration, regardless of environments, regardless of platform, and regardless of the scale of data, even while those data continue to change.
So to talk a little bit about how WANdisco does this, how we employ a LiveData approach to achieving this outcome, conceptually it’s quite straightforward. We obviously productize this through technology offerings we refer to as LiveData Migrator, which is available as a standalone product and obviously is cloud services as well, but the approach it takes is conceptually quite straightforward. Firstly, we intercept activity between clients and their use of underlying storage. The reason we intercept and interject in that activity is so that we can use consensus across a distributed set of deployments using data at a global scale to obtain an agreement on any changes that those applications want to make to the data itself. We replicate those changes between environments in order to take the outcome of that consensus led approach and maintain data consistency by replicating data between environments prior to it being changed at the source. So we agree on the replication. Applications make changes to their source environment and WANdisco’s technology uses this approach of employing distributed consensus to guarantee a consistent outcome across each replica of data regardless of the scale. Now, for pre-existing data we also introduce a scan of content at the source. We only need to do that once. We don’t need to repeatedly scan content. We don’t need to schedule migration, or schedule small batches of the movement of data from one place to another. We conduct a single scan of source, and we replicate ongoing changes with global consistency by employing this consensus based approach.
So the benefits that result from that are firstly it allows us to automate data migration. If you’re moving from Hadoop, or some other platform on-premises, into a platform like Spark running in the cloud, we can automate the data migration by simply selecting data sets to be replicated, initiating migration, and leaving applications to operate as they would have before, eliminating the down time and business disruption that would result from other approaches. We can guarantee that data won’t be lost or modified during transition because we’re employing a global, globally capable consensus based approach to replicating changes to data. This obviously improves the overall efficiency of the data replication, and in the end enables you to achieve migration more rapidly with minimal or zero down time and disruption to business operations. So to give you an example of that, I’ll hand it back to Mada to talk through a case study referencing a U.S. based retail environment that has taken advantage of this technology along with the process led approach that Infosys adopts to migrate multiple petabyte cloud based storage environment between Azure Databricks as the primary workloads, thanks. – Thanks, Paul. So the customer that we are referring to is a U.S. based retailer who has an Azure and Databricks based environment.
The customer in question is a large retailer based in the U.S. The environment that they have is Azure and Databricks heavy.
Having built this over a number of years, having migrated from (mumbles) into Databricks. It still used a large number of legacy components, namely ADLS Gen2 and Parquet for, you know, tables and organization. The target was to go to the latest generation, ADLS Gen2, for the target platform, which enables a better security model and also unlocks a number of capabilities with respect to analytics and AI. It would also convert into a Delta Lake format for storage and organization, and the challenges are pretty straightforward, multiple petabytes of data that needs to be migrated with a large number of application workloads with interdependencies between them which cannot really be done in one big bang approach. The approach we took was to deploy WANdisco’s LiveData platform and combined with our processes and methodologies around inventorying analysis at the beginning and validation from an independent perspective post-the migration, at the end before cutting over.
This provides a lifecycle view of what approach we took. The analysis and design of the entire inventory takes place at the very beginning of the engagement. We splice and slice and dice this into multiple waves.
In this case, we had about five different waves that tackled different workloads. For example, the data engineering workloads, some of the data engineering workloads also had near real time data ingestion happening which couldn’t really be stopped, as well as some of the workspaces that belonged to the teams and consumers who had their own lifecycle with respect to how they develop, deploy, code, and deal with the data.
The approach that we took split this into five different waves, and each wave looked something like what’s in the middle of your screen. The historical data was migrated using the one scan approach that Paul talked about at the very beginning of that wave. Then turn on LiveData replication that keeps the data in sync. When we are ready cut over once the code is remediated and tested, that happens in a parallel track. We validated this using what we call as a parallel run which is right at the bottom of the bottom track over there that keeps it running in parallel with the ingestion, ingested data kind of float across both platforms running in parallel for a period of time. Typically, it’s about a few days, maybe a week, and then that gets validated by the business during that period and gets cut over.
That process allowed us to kind of seamlessly migrate this minimizing the down time over a shorter period of time than expected.
This kind of gives you an overview of the migration architecture that was deployed. As you can see, that was ADLS Gen1 and the Databricks environment associated with that, and the Gen2 and the Databricks environment associated with that on the right side which is our target environment. The LiveData platform provides all the necessary toolkits to both migrate this data, as well as keep this data globally consistent, once the data is initially migrated.
And the outcomes were a testament to the technology
as well as the processes that we deployed here. The over 3 1/2 petabytes of data was migrated over four,
and over five phases and it took about 10 weeks to complete this migration. We targeted a release every couple of weeks, you know,
to happen over the weekend to minimize the impact to the business, and the security requirements which were stringent were met by the process.
The scale it took to migrate this was delivered by the underlying LiveData platform to perform this migration, and the risk was minimized to a near zero, and that was basically the, makes the migration process absolutely easy and painless for the customer. And we of course migrated the Databricks workloads to Delta Lake as well. – Thanks very much, Mada. Hopefully that was a good presentation for all on the business risks and technical approaches that can be taken to large-scale data migration. We’ve made reference here to some industry analysis of the business risks as seen by organizations wanting to adopt cloud and migrate large-scale analytic platforms to cloud infrastructure, reviewed the technical approach that can be taken with WANdisco’s live data platform, as well as the process that our partner Infosys puts around that, and made reference to our successful case study of, again, large-scale data migration to Azure Data Lake Storage Gen2 for a predominantly Azure Databricks environment for multiple petabytes of data. So thanks very much for listening. If you’re interested in any further information about WANdisco’s LiveData platform, you can contact firstname.lastname@example.org, or obviously go to our website, or the Infosys website to hear more about what we do jointly.
Lead Technology Consulting and Cloud for Data and Analytics practice