5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

May 27, 2021 04:25 PM (PT)

Download Slides

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

In this session watch:
Ron Guerrero, Solutions Architect, Databricks

 

Transcript

Ron Guerrero: Hello, everyone. My name is Ron Guerrero. I’m a Senior Solutions Architect at Databricks. I’ve been with the company for over two and a half years. And for the last 10 years, I spent most of my time in the data warehousing and big data space. Prior to Databricks, I worked for Hadoop vendor in both pre and post sales, helping customers design and implement data engineering workflows.
Today, I’m going to talk about the importance of modernizing your data architecture, how to embark on the migration journey. Then we’ll talk about some topics we get asked frequently when it comes to migrations.
Why migrate off Hadoop into Databricks. Let’s start with some background info about Hadoop. It was created more than 15 years ago. It’s an open-source distributed storage and compute platform. Early on, it was cheaper than traditional solutions at the time, you didn’t need to run it on special hardware. It allowed for big data processing, where HDFS is the storage component. And the processing piece was originally MapReduce, but now includes Apache Spark. Hadoop also consists of multiple open-source projects. And you can deploy Hadoop both on-premises as well as in the cloud. But it’s complex. It’s a highly engineered system with a zoo of technologies. You need highly skilled people to manage and operate the environment, when ultimately people just want to process their data, then analyze and query the results. And unfortunately, it’s this complexity that has deterred people from using the system. What we’ve seen is very few advanced analytic projects being deployed in production on Hadoop.
As well, the environment is fixed. Services are operating 24-7, the environment is sized for peak processing, it can be very expensive to upgrade. And it’s a maintenance intensive solution. You need dedicated teams to keep the lights on and the fragility of the system affects users from getting value from their data.
Enterprise’s need a modern data analytics architecture, it has to provide both scale and performance in the cloud in a cost effective way. Simplicity to scale and performance goes hand in hand. Now, why is performance important? While the shorter the execution time, the lower the cloud costs, it needs to be simple to administer, so that the focus can be on building out use cases. And the architecture needs to provide a reliable way to deal with all kinds of data so that it can enable predictive and real time use cases to drive innovation.
So enter the Databricks lakehouse platform, built from the ground up on cloud, we support AWS, Azure and GCP. It’s a managed collaborative environment that unifies data processing, analytics via SQL, advanced analytics, so data science and machine learning, along with real time streaming. You don’t need to stitch multiple tools together, nor worry about disjointed security, or have to move your data around. Data resides in your cloud storage within the Delta Lake. Everything is in an open format, access by open-source tooling. You have full control of your data, and your code.
Let’s talk about planning your migration. Before you even start the migration, there are several things you’ll need to go through. As with any journey, you start with a question, where am I now? Where do I need to go? You then assess what you have, and plan for the new world. Along the way you’re going to learn a few things, you’re going to test and validate some assumptions. And finally, you can execute on the migration itself. A set of questions you should ask yourself, why do you want to migrate? Is a license renewal coming up? Maybe end of life for a particular version of your Hadoop environment, or a hardware refresh is coming up that you want to avoid? This typically drives the required timeline. You’ll also want to consider what resources will be needed or what’s your cloud strategy. Are you going to Azure, Google, AWS, perhaps you have a multi cloud strategy in place.
Then you go through an assessment, you’re going to take an inventory of all the migration items, take note of the environment, the various workloads, and then prioritize the use cases that needs to be migrated. While a big bang approach is possible, I’ve been working with some customers that are moving across all their data, and their workloads in one shot, I would say a more realistic approach for most will be to migrate project by project. You’ll need to understand what jobs are running, what the code looks like. And in some scenarios also have to build a business justification for the migration, which could include calculating the existing total cost of ownership and forecasting what the costs will be in Databricks.
We then move to the technical planning phase, what should the target architecture look like? The general data flow will be similar to what you already have. In many cases, it’s mapping older technologies to new or perhaps there’s an opportunity to consolidate and optimize. You’ll think about how you can move the data to the cloud. And with the workloads. Would it be a lift and shift, or perhaps something more transformative, leveraging the new capabilities within Databricks, or maybe it’s a hybrid of both. You’ll also consider data governance and security and introduce automation where possible, because that would ensure a smoother migration, which is less prone to error, and introduces repeatable processes.
You also want to make sure that your existing production processes are carried forward to the cloud, tying into your existing monitoring and operations. There’s also enablement and evaluation, you need to understand what the new platform has to offer and how things translates. Databricks is not Hadoop, but it provides similar functionality in areas around data processing, and data analytics. We also recommend some form of an evaluation, targeted demos, perhaps workshops, or we may jointly plan a proof of technology to help you vet an approach within your environment.
Finally, the last step is executing the migration. You’ll deploy an environment, then migrate use case by use case, first by moving across the data, then the code and for some period of time, you’ll continue to run workloads on both Hadoop and in Databricks. Validation is required to ensure everything is identical in the new environment. And when things are great, you can cut over to Databricks, and decommissioned the use case from Hadoop. You’re going to rinse and repeat across all the remaining use cases until they’re all transferred across. After which you can then decommission the entire Hadoop environment. Databricks along with our ISV and SI partners can help you across all of these areas.
But today, we’re going to focus on enablement. Specifically, we’re going to discuss five key areas of migration with Databricks. How will you administer or operate the new environment? How do you move data across? How will you process your data? What are the security and governance controls on the platform and ways to interface with Databricks?
Let’s start with administration. But I wanted to quickly review some concepts you normally see in Hadoop and how it compares to Databricks. I’ll then give you a walkthrough of the Databricks environment.
Hadoop is a monolithic distributed storage and compute platform. It consists of multiple nodes or servers, each with their own storage, CPU and memory. Work is distributed across these hosts. There’s a resource management system YARN that attempts to ensure workloads share compute accordingly. Hadoop also has metadata and security controls. The Hive meta store contains structure information about data stored in HDFS, century arranger is used to control access to the data. Users and applications have a few ways to access data. They can access data directly from HDFS using HDFS API’s, or data can be extracted to downstream systems. Data can also be accessed via SQL using a JDBC or ODBC connection. Hive is used for generic SQL in some cases ETL scripts or Impala or Hive LLAP for interactive scenarios.
When we look at the Databricks side, I want to point at some key differences. You’ll see in the diagram you can create multiple clusters within a single Databricks environment, each for a specific use case, maybe a cluster per project per development group for batch restreaming. The main point is you’re not working with a single cluster. The clusters are meant to be ephemeral. The lifespan can be for the duration of the workflow it needs to execute, where the cluster is started when needed and terminated when idle for some time. Databricks does not provide data storage services like HDFS, HBase or Solr, data resides on your object storage. HBase or Solr will have equivalent technologies in the cloud that could be leveraged, whether it be cloud native, or an ISV solution.
If we dig into the details of the Databricks environment, you’ll see that each compute node will map to a Spark driver or a worker. Again, you’ll have multiple clusters. Each will be completely isolated from each other. And this ensures that strict SLA is can be met. You can truly isolate streaming in real time use cases from each other. And you don’t have to worry about long live jobs that will take away resources in perpetuity from other workloads. Databricks also has a managed Hive metastore. But you can have Databricks leverage and external one, for example pointed to something like AWS glue, or metastore service that you spin up somewhere in the cloud. You can specify access controls at the table level, or rely on object store permissions. Yeah, credential pass through. We’ll talk about that later in the presentation. As far as endpoints, you can access Databricks via SQL for ad hoc in advanced analytics, or data can be accessed directly from cloud storage.
I’ll now provide a demo of the Databricks environment. This is the Databricks user interface. Databricks is a managed environment. Storage is decoupled from compute. So you’re not managing storage, you are provisioning compute, and Databricks has a collaborative notebook environment for development. You can define and deploy workflows in Databricks. While I am showing you the UI, you can interact with Databricks using REST API’s or the command line interface.
Back to the discussion of monolithic versus real clusters, you’ll see in Databricks, you can define multiple clusters, for streaming for data ingest, general analytics or for model inference. Under the runtime column here you can see this is the software stack that gets deployed for each compute. And each cluster can have different versions of the runtime. Likewise, you can choose different VM instances that each cluster would use.
Some clusters maybe memory optimized, storage optimized or GPU ready. Let me walk you through an example of how you create a cluster. You click on the Create Cluster button and you can provide a name to your cluster, and then select a particular runtime. As I mentioned, this is the software stack that gets deployed in the cluster. It’s not just Spark. It’s also additional frameworks that data engineers data scientists and analysts would like to use. Some have machine learning capabilities around deep learning. Others have GPU libraries precompiled into the software stack.
When it comes to upgrades, it’s as simple as modifying an existing cluster definition and selecting the latest and greatest Databricks runtime. Compare this to Hadoop where an upgrade may take weeks or even months to execute on. You have auto scaling options with our clusters. So you can specify the minimum number of workers that you want the cluster to shrink down to, as well as the maximum. There’s a wide range of VM instance types that you can choose from, from storage optimized, memory optimized, compute, as well as GPU accelerated ones.
You can also customize the cluster by providing your own special Spark configuration parameters, environment variables, as well as providing an initialization script, a bash script that you can use to essentially load libraries or set the environment for every single node when the cluster starts up. And finally, you can add tagging, which allows you to identify the VMs or the Databricks spend for a particular line of business. So mostly use for chargeback type scenarios.
Now a general approach to compute in Databricks is to use what you need and when, these won’t be long lived clusters unless it’s a streaming use case. And it’s possible for a job to provision compute based on a job definition, which we’ll see shortly.
Let me walk you through our notebook environment. And here’s where you’re going to define your code for execution. This is not the only way that you can code against Databricks. You can also do some local development and use a feature called Databricks connect that allows you to develop on your local workstation but have execution run on the cloud in Databricks.
A notebook has a default language, here you’ll see that it’s Python, but you can intermix languages. In this particular notebook, a data engineer is doing some work in Python, ingesting the data doing some data transformations. At some point, as part of the collaborative capabilities within our platform, you may have an analyst who may want to do some additional processing of that information, you can see that we change the language from Python to SQL. And now you’re able to interact with your data using standard SQL commands. Further down in your pipeline, you may want to do some data science workloads, perhaps the data scientists would like to use R. And so again, you can change the language to something like R. And data scientists can do their statistical work against the data.
You can also have Git integration with our notebooks. So you can check in your code into a source control repository. We have revision history to allow you to see who made what change and when and what those changes were. And you can also add comments to the notebooks. Again, we’re fostering collaboration across the various personas who are developing pipelines. And as part of the development process, they might want to comment on certain pieces of the code that’s being written.
And finally, when the notebook is completed, you’ll want to execute this perhaps on some scheduled basis, so you can create a job that will execute it. Which leads us to our next and final topic, for the walkthrough of our platform, which are jobs.
I can create a job based on a notebook that I’ve just created a jar file that might have been provided to Databricks, you can also submit Spark jobs similar to the Spark Summit syntax that you might be already familiar with or Python script. As part of the job definition, you’re also to specify a cluster definition. And you’ll notice that it’s very similar to the interface of creating a new standalone cluster. So the same options are available. The key thing here is the definition within this particular job specifies that this cluster is going to be ephemeral. So when the job will execute, it will on the fly, create the cluster, run the code that you provided. And once the code completes, it will tear down the cluster. You can specify the scheduling that you want for this particular job, as well as set up notifications and alerts.
There’s a lot more features that I could show but just wanted to focus on infrastructure management, workflow orchestration, as well as development. We’ll now talk about data migration.
So in Hadoop, you’re dealing with fixed storage capacity in HDFS. You’re scaling data with compute, so they’re tightly coupled. If you need more storage, you have to buy more servers, which means you have to get more CPU which may or may not need it. And there’s housekeeping required to rebalance that data. When you go to the cloud, you are looking at near limitless capacity. You’re paying for what you use. There’s no need for a lot of the maintenance, normally associated with HDFS. You get a high level of durability, and a wealth of options available for migration.
We do recommend moving your data to the Delta Lake in the cloud. There’s a litany of reasons why you’d want to do this. But I’ll highlight a few key points. The Delta Lake is open-source and open format, so you’re not locked into a proprietary format. The Delta Lake is also performant, allowing for faster queries, which means a better end user experience with lower cloud costs. It also supports acid transactions, so you can ensure that the data is reliable. I can have producers of information writing to a data set at the same time that a consumers reading from it, accessing only committed data, ensuring that consumers get access to the latest and greatest version of that data.
The Delta Lake also provides schema enforcement for evolution, you can have workloads that will validate that the incoming structure is what is expected. Or in cases where you know there’s going to be an application, an upgrade on the source system, they may add fields, let the new fields propagate down into the Delta Lake. This is also possible.
The last thing that we’ll talk about is time travel, which allows you to see the version of your data based on a particular point in time, or a version number of the dataset. And this allows for auditing, for rollback, as well as reproducibility for machine learning experiments.
How do we begin with data migration? I like to talk about dual ingestion, you’re gonna have a current feed to redo, just add another feed to cloud storage. What I like about this is that it opens the door for new use cases. In many cases, I’ve talked to customers that couldn’t do advanced analytic projects on the current Hadoop environment. And so just providing the data to the cloud, unlock the capability to at least work with that data using Databricks. The dual ingestion can also provide an option for backups. So now data is available for access, in case a primary site is no longer available.
After dual ingestion, you also want to look at how you migrate historical data. How do you migrate the data itself? There are a few options here. Many times I’ve seen a framework already in place to ingest data. Third party tools like ETL, Informatica Taloned, there might be an in-house framework already in place. In either case, you’re simply forking the target to write to Hadoop, as well as the cloud storage. In other cases, you may need to whip up something simple. The key thing is to just land the data in the cloud, it’s fine to land it in its raw form. The new data flows can be rather simple. There are options when it comes to moving the data, either a push, or a pull option. Pushing the data is normally the easiest way to get data into the cloud. It has simple network security. Your data owners or security admins have more control. They can dictate how and when did it gets moved to the cloud. As far as technologies, you can use native solutions like DistCP within Hadoop, the distributed copy, HDFS blocks are copied in parallel to cloud storage. Most Hadoop admins or operations folks already know how to use it. However, a tool like WANDisco allows you to synchronize HDFS with cloud storage, as well as replicating metadata information.
And as mentioned earlier, other traditional ETL tools can move data from one place to another. Cloud solutions exist like AWS Snowmobile, or Azure Data Box or Google Transfer Appliance, for cases where you’re dealing with large amounts of petabytes or exabytes worth of data. The other option is to pull data from Hadoop using Databricks. Perhaps you’re already landing data into a messaging system like Kafka. So it’s just a matter of creating a consumer in Databricks using structured streaming. But Databricks can also pull data directly from HDFS or make a connection to on-premises databases via JDBC. Since it’s the outside, coming into your enterprise network, a pull approach would be more involved from a security perspective.
The push option is well understood, it’s the pull option that is more nuanced. And folks since asked me how this approach works. First, you need connectivity from the cloud into your on-premises environment, you can use cloud native offerings to facilitate that connectivity. And in many cases, I’ve seen it already set up for other use cases. For Kerberized Hadoop environments, you will need to create an initialization script. Remember that piece that I showed you when you’re creating a cluster? Yes, you’ll need it here so that the Databricks cluster can set itself up to allow Kerberos connectivity or Kerberos authentication.
The clusters are ephemeral, so it’s going to every time install a Kerberos client. The krb5 conf and keytab just needs to be accessible on a cloud storage, where it can be protected and the script will run the init command, which will grab a Kerberos ticket. And now Databricks will be able to authenticate to any of the Hadoop services, you also need to copy some of the relevant Hadoop configuration files. And the last point about, that I want to highlight here is that you can have Databricks point to the Hadoop metastore, which is important if you want to access Hadoop tables, Hive tables using Databricks. Let’s look at a demo that shows data being pulled from an HTP cluster into Databricks.
So first, I’m going to bring your attention to a cluster that I’m going to use for this particular demo. It has an init script, where part of the initialization is to point this particular Databricks cluster to use the HTP Hive metastore. And let’s look at the corresponding notebook. It’s going to be extracting data found in Hadoop. And rather than switching between platforms, this notebook is going to pretend to do some of the processing steps that Hadoop would have done. And I’m going to highlight the step where Databricks pulls the data.
So Databricks is sharing the same metastore as HTP. And to just isolate where the data resides, I am going to create a database called Delta Hadoop, which will store the Hadoop artifacts and you’ll notice in the location parameter, it is pointing to an HDFS location, I want to pull that data into the cloud on a different database. We’re calling it Delta on ADB or get Delta on Azure Databricks, you’ll notice that we don’t specify a location, which means that Databricks will use a default cloud location. If we describe the database, as I mentioned earlier, the Delta Hadoop database is using, will have its data stored on HDFS, the cloud version will point to a default cloud storage.
So the next few cells is the prep work that would have been done in Hadoop. So this is ingesting the data in its raw form, doing some lightweight transformations, and then converting that raw form of the version of that data to a bronze layer. And there are some additional transformations that are done. In from that bronze layer, there are some final modifications to create a gold copy and this is the copy that we want to pull into Databricks. How do we pull it into Databricks, we’re going to issue a create or replace table with a deep clone option.
There’s two types of clones Deep Clone and Shallow Clone, with Deep Clone, it’s copying the underlying data plus the metadata. A Shallow Clone only copies the metadata, but because we’re doing a data migration, obviously we do want a Deep Clone and this particular command will also copy partition information from the source table. If you look at the definition of the table, you’ll notice that both on the source and the target on the cloud will have matching structures. And if we look at the underlying folders where the data resides, you’ll have similar information. Now this one is on cloud storage. Meanwhile, the source was on HDFS.
Within Databricks and Delta, you can actually see an audit of the operations that were performed for a given table. A clone operation was performed in this particular scenario. So this is one way to audit track what has been migrated across. And finally, we can do a check to make sure at least the row counts match, which they do. So for this demo, I wanted to highlight that you can have Databricks and Hadoop point to a shared metastore. You can access data assets from Hadoop once you’ve configured the cloud environment to be able to talk to on-prem, and data can be copied into the cloud using Spark.
Let’s move on to data processing. From a mapping perspective, things are pretty straightforward. All data processing in Databricks is going to be done in Spark. I’ll talk about what this means with respect to existing non-Spark workflows. Workflow automation can be tackled in a few ways. Databricks has a job scheduler, which I showed. But some of the features and easy like forking, joining, decision points are upcoming features we’ll be releasing later this year. At the moment, this can be done programmatically. You could also use native integrations with airflow and Azure Data Factory. If you’re currently using Zeppelin or Jupyter Notebooks these will map to Databricks notebooks. But as I mentioned earlier, you can also point a local development environment like IntelliJ, Eclipse to Databricks using Databricks Connect.
Migrating Spark jobs is relatively straightforward. Some Hadoop environments are running an older version of Spark, Spark 1.6 as an example. There are some minor differences between Spark versions and the Apache Spark Doc’s will indicate what you need to account for. Some older Spark code also uses RDDs. While it’s possible to use RDDs, in the newer versions of Spark, you will not be able to leverage the full potential of the Spark optimizer. So we do recommend changing over to dataframes, it’s going to simplify your code and your code is going to run faster.
The way you submit jobs is slightly different. You can use Spark submit semantics. But there are more performant ways to submit workloads, specifically submit jar is a recommended way in Databricks. And things like the dash dash files option is not necessary, as cloud storage is already shared across all worker nodes.
Finally, something that I’ve seen across the board are hard coded references to the Hadoop environment. And this would just be a search and replace. I wouldn’t consider any of these items, large bodies of work, but something you need to consider with your Spark code. In practice, I’ve seen many customers just run their code in Databricks with minimal tweaks. Where things get tricky is non-Spark workloads. These will be code rewrites. MapReduce is not commonly used anymore, but we do come across it once in a while. Well, the general framework will have to be rewritten, you may be able to leverage some of the logic in a shared Java library that Spark can invoke.
Sqoop can be converted to Spark using the JDBC source. It has similar options to Sqoop like customized queries or various parameters needed for parsing incoming data. And Flume is just converted to a Spark structured streaming job. Many Flume jobs are pulling data from Kafka and lending to HDFS. Moving to Spark, you’re moving from a config base setup to something more programmatic, which is easier for some.
Nifi is an interesting technology as you normally see it outside of Hadoop, hortonworks and now cloud era sells this as a separate skew. While there is some overlap with Databricks capabilities, it serves a different purpose. It’s a drag and drop self service ingest tool. And some customers may continue to use this in the cloud. Others look at alternatives like stream sets.
Hive is a highly compatible SQL solution with Spark SQL. Most queries are just going to run as is in Databricks, where there are differences are more so on the DDL, for example, Spark has the using clause versus Hives format clause. We do recommend converting your DDL to Spark SQL format as a Spark optimizer will recognize this and you’ll be able to leverage its optimizations. Spark SQL can also use Hive Serdes or UDFs or UDTFs. And really, the only minor difference that I’ve come across would be on how map types are treated internally. And look for a future blog posts on this. Oozie workflows can be complex, and these may need to be rewritten to an equivalent workflow in airflow data factory or some other technology.
Databricks can be interacted with various REST API’s. So pretty much any scheduler can be tied into Databricks. And while it’s possible for environments to have a large number of Oozie workflows, in practice, there’s usually a common set of templates, a handful, or once you’ve converted them, they apply to the hundreds of 1000s that are out there.
To make the migration easier, you can use a third party tool such as Mlens, they offer the capability to migrate PySpark code, HiveQL and can convert Oozie workflows to an airflow equivalent. And I understand that Azure Data Factory is on the roadmap.
Let’s talk about security and governance. In Hadoop, you would be managing LDAP for some services like the admin console or potentially Impala. For other services you’re using Kerberos for authentication. Authorization is going to be done either via Century or Ranger and for data governance or metadata management Heartless or Navigator.
In Databricks authentication can be done through single sign-on with SAML 2.0 supported corporate directories. Some supported identity providers include windows Active Directory, Azure Active Directory, Okta, Google Workspace single sign-on, AWS SSO, as well as many others. For authorization ACLs are available for Databricks callback, you can put permissions on Databricks objects such as notebooks, jobs and clusters. In scenarios where users are presented solely with a SQL interface to Databricks, you can define table ACLs. With table ACLs, you can define views to introduce row and column filtering. And built-in functions to return the current user or what group that user belongs to, allows you to create dynamic views that can mask certain fields or encrypt certain fields. In scenarios where the data process is being processed as files, you can leverage cloud native security with IAM Federation or AAD passthrough.
Here you configure permissions at the cloud storage level. And Databricks passes credential information to the storage service, which will then determine if access should be granted or denied. For example, if I logged into Databricks, and I run my code, and it’s trying to access data in ADLs, my credentials have passed through at last to confirm that I have the necessary permissions. From a metadata management perspective, we integrate with many third party tools like AWS Glue, Informatica Alation, and Collibra.
For customers that have a requirement for attribute based access controls, there are options. Customers that are deeply ingrained with Ranger will typically look into Privacera, so I’ll talk about that tool today. Their founders are the creators of Apache Ranger. The challenge they might be presented within your Hadoop environment is the myriad of security policies that are already defined in Ranger. In some organization, there’s hesitation to move away from the security model. So we get asked how you can carry this over into Databricks.
With Privacera and Databricks, you would bring the data across from Hadoop into cloud storage. Metastore information can be migrated to the Databricks metastore or moved to an online store like at AWS Glue, and then Databricks will then leverage Ranger policy defined in Privacera. And these policies would be extracted from an on-premises Ranger deployment. Privacera provides seamless integration to transfer policies from one system to another.
When you marry the two technologies, you now enable Databricks to have row and column level data access without the need for views. You can have dynamic or static data masking, as well as attribute based access controls. Privacera also supports in enforcing file level permissions, including the read and write operations.
To dig into the details, the way that the products work together is that the Databricks cluster will run a special init script that will install the Ranger plugin. The plugin will talk to the Privacera deployment. So when Spark is going to process a query, the query is sent to Privacera for inspection by the Ranger policy manager. If the query is authorized to access the data, the query is returned to spark for execution. Privacera may inject query rewrite based on the policies that are defined. For example, if there are a row, column filtering that set, Privacera may change the where clause that gets executed by Databricks. The query is also audited within Privacera.
What about the SQL community? Well, in Hadoop, you had HUE the Hadoop user experience, where business analysts can browse through HDFS, issues SQL queries, and get some lightweight visualizations. You have the same capabilities in more in Databricks SQL analytics workspace. This is a public preview feature that you can try out now.
Our vision in the SQL in BI space is to give SQL users a home in Databricks. While our notebook environment provides a premium experience for data engineers and data scientists. SQL focused users are traditionally more comfortable with SQL editors and dashboarding capabilities. These are things that we’ve added to our platform. By adding them, companies can now work with the data where it resides rather than move it around. It requires minimal setup, with great price performance.
Now the focus of today’s session is on Hadoop migrations. So I’m not going to do a demo of this feature, but did want to highlight some of the capabilities. You get familiar with SQL editor with auto completion, built-in visualizations, the ability to browse through data. Alerts can be set up based on query results with notifications. And you can also build dashboards.
There are built-in connectors for many existing BI tools. For example, we have connectors for Power BI and Tableau that have SSO support.
As far as performance, our Delta engine consists of a new MPP engine built from the ground up in C++. It leverages SIMD or single instruction multi data extensions in the CPU, where we’ve seen a 20x performance gain compared to Vanilla Apache Spark. We’ve also added some metadata performance updates that improves the interactive experience when creating a large number of tables. In our new JDBC ODBC drivers provide lower latencies in higher data transfer speeds so that query results will return faster.
Finally, SQL analytics allows you to scale clusters of clusters, accommodating a high level of user concurrency. In closing, your Hadoop journey starts with the plan. And we’re happy to help you with this along with our partner community. Let’s figure out where you’re at, how to get from point A to point B. Validate and address the corner cases that may exist in your environment and help execute the migration. As you think about the migration, keep in mind the topics discussed today. The simplified administration in Databricks, the data and workload migration options, the security controls available in the platform and on the cloud and the ability for users to access the data in the ways that they want.
We have sample reference architectures for all cloud providers, each outlining some of the typical integration points with cloud native services. As an example, for Azure, we integrate the Event Hubs Data Factory, Cosmos DBs Synapses, and Azure ML just to name a few.
For AWS, we have integrations with Glue, Kinesis, Redshift, Detto, DynamoDB, SageMaker has some examples. And finally for GCP, integrating with Google Pub/Sub BigQuery, AI Platform and others.
For more information, please visit us at databricks.com/migrate. And we look forward to your feedback on today’s session. Thank you very much.

Ron Guerrero

Ron is a Databricks Senior Solutions Architect. He has been working with Big Data platforms and solutions for the last 10 years. For 4 years, Ron worked for a Hadoop Vendor both in technical sales and...
Read more