Skip to main content

The 3Ds of Migrating Teradata Workloads to the Databricks Lakehouse Platform

Best practices for migrating Teradata workloads to Databricks
Suresh Matlapudi
Mendelsohn Neil Chan
Abhishek Dey
Share this post

Many large enterprises have used Teradata data warehouses for years, but the storage and processing costs of on-premises infrastructure severely restricted who could use the resource and for what purposes. On top of that, an upgrade is a long process, plus Teradata needs to ship the customers the hardware and install it in the customer's data center in the event of an outage.

Migrating your legacy Teradata data warehouse to the Databricks Lakehouse Platform can accelerate your data modernization journey. Still, it may seem complex and daunting, requiring a well-orchestrated and planned effort. During the initial scoping phase, you may discover that your organization has accumulated hundreds of jobs and thousands of SQL scripts over the years.

However, fear not! Enterprise customers like Walgreens, Sam's Club and many others have successfully migrated their legacy Teradata Data Warehouse to Databricks, enabling them to save millions in infrastructure savings while at the same time accelerating innovation and productivity. This blog post presents a systematic strategy to accelerate your journey toward adopting the lakehouse in a framework encapsulated in an acronym, 3Ds: Discovery, Development, and Deployment.

If you are wondering how Databricks is different from Teradata, the summary matrix below illustrates how the Databricks Lakehouse Platform compares to a traditional data warehouse like Teradata:

Capabilities comparison

  Databricks On-Prem Teradata
Data format Delta (open source) Proprietary
Data types Structured, Semi-structured, Unstructured Structured, Semi-structured
Languages supported SQL, Python, Scala, R SQL only
Use cases BI, SQL, Machine Learning/ Data Science, Real-Time Streaming BI, SQL
Reliability High-quality, reliable data with ACID transactions High-quality, reliable data with ACID transactions
Governance Fine-grained access control for tables, rows, columns with Unity Catalog Fine-grained access control for tables, rows, columns
Architectural paradigm Lakehouse Platform Traditional Data Warehouse
Licensing Model Consumption based Annual subscription + Additional feature(s) + Support/ Maintenance + Upgrades cost

Step 1: Discovery

Profile the Teradata Environment

The first step in the migration process is to comprehensively understand your Teradata environment to determine the overall scope and effort required for the initiative. Below are two key questions you'd want to know about your Teradata deployment:

Question #1: What does my Teradata topology look like regarding hardware specs? (e.g., storage utilization, system utilization, warehouse objects information, query types)

To answer this question, extracting and analyzing syslogs from your Teradata warehouse environment is a great place to start. To accelerate this process the Databricks migrations team has developed a Teradata Profiler tool to accelerate this process. The profiler uses Teradata's system tables and service called PDCR (Performance Data Collection and Reporting), a data application that provides an understanding of system performance, workload utilization, and management. This migration assessment utility helps you automatically discover and profile the current Teradata Warehouse environment. In addition, the tool also helps in DBU (Databricks Unit) forecasting. The screenshot below illustrates the key insights generated by the Teradata Profiler dashboard (for more information on running the profiler, please get in touch with your Databricks representative):

This is a sample Teradata profiler dashboard
This is a sample Teradata profiler dashboard

Question #2: How many workloads do I need to migrate to Databricks? How easy or complex are my environment's jobs and code patterns?

To answer this second question, you need to review the actual code, scripts, data warehouse objects, and jobs accumulated in your Teradata warehouse environment and create a summary inventory of these artifacts. To accelerate this analysis process, Databricks highly recommends utilizing a code profiler or analyzer (e.g., BladeBridge, LeapLogic, DataMetica etc.) or solution accelerators built by one of our BrickBuilder Migration Solution partners. These tools typically provide detailed reports of what's inside your environment and break down data pipelines and jobs into various buckets based on complexity. It allows you to scope out the effort required for the migration initiative and any code refactoring that may be necessary during this process.

In addition to analyzing jobs complexity, these analyzer tools produce several useful summaries, including a listed inventory of assets and artifacts in the Teradata environment; examples include

  • SQL scripts
  • Dynamic SQL
  • Functions,
  • Referenced Objects
  • Program-Object Cross Reference
  • Total Lines of Code

Sample summary output of the BladeBridge Code Analyzer
Sample summary output of the BladeBridge Code Analyzer

The analyzer provides a good understanding of your Teradata warehouse environment by auto-analyzing the code/scripts, as you can do a detailed migration assessment and effort estimation. You are ready to embark on the next step in your migration journey!

Step 2: Development

Now that you have assessed your Teradata workloads in the discovery step, the next step is the actual migration of historical data and associated workloads to the Databricks Lakehouse Platform. This section will walk you through the development activities to achieve that.

2.1 Data Warehouse Extraction

To get started with data migration, the Teradata Parallel Transporter (TPT) is a client utility that provides scalable, high-speed, and parallel data extraction and loading. Using TPT, you can extract the data from all your tables in Teradata at scale and then push the data files into cloud object stores such as AWS S3, Azure Data Lake Storage, or Google Cloud Storage. Utilizing TPT to unload data offers several critical benefits listed below:

  • Ability to define field delimiter, date formats, and encoding type
  • Control to determine resource allocation for data unloading to achieve better performance
  • Define the number of generated output files and their corresponding file type
  • Supports checkpointing and resume operations in case of failures and interruptions

Alternatively, you can use an ETL tool of your choice like Arcion or others to perform the above operation as well. You can push these extracted files to the cloud storage using cloud-native CLI or managed services or use any open source/third-party ingestion tools.

Once the extracted load-ready files in csv or text formats have landed on the cloud storage, you can use Databricks Autoloader for automatic incremental ingestion. It will take care of the historical data ingestion.

Diagram illustrating the transfer of data from Teradata to AWS S3
Diagram illustrating the transfer of data from Teradata to AWS S3

From an incremental load standpoint, you must ensure that the ingestion process pushes the source data to the cloud storage location for all the tables in scope. We recommend using the original source systems directly to decouple the ingestion pipeline and remove dependency on Teradata as a source system. Usually, this is a CDC source, which is handled by ingestion tools like Fivetran (HVR), Airbyte, Debezium, Azure Data Factory, AWS DMS, Arcion or others, depending on your choice of ingestion tooling and source system(s). In the case of existing logic using MLoad, TPT or Fast Load scripts, where you are performing incremental loads into Teradata today, that can be taken care of as part of MERGE INTO functionality in Databricks.

 

2.2 Code conversion and pipeline development

When converting code from your Teradata Warehouse environment to Databricks, the primary goal is to leverage automated methods as much as possible. The conversion of the Teradata logic and functionality using one of our migration tooling ISV partners or a BrickBuilder solution simplifies and accelerates the modernization effort to a large extent. As a best practice for migration, we recommend that you group related code belonging to a data application end-to-end or subject area together and trace it backwards from the reporting layer to the base tables.

Migrating code that has accumulated over the years may seem to be an intimidating and daunting task. Let's break them down into four major categories listed below and explore each area in more detail to approach the code migration systematically:

  • Data Type Conversion
  • Table DDLs
  • Table DMLs
  • BTEQ Scripts
  • Stored Procedures

Teradata has its dialect of the SQL language but conforms closely to the ANSI SQL that Databricks adheres to. Below are the indicative guidelines for code conversion between Teradata and Databricks:

1. Data Type conversion

The conversion of SQL data types from Teradata to Databricks is straightforward, due to the ANSI-compliance of Databricks SQL. DDL statements and scripts in Teradata can be ported over to Databricks seamlessly, with most source data types being retained.

On certain occasions, the process of type promotion will occur, which is the process of casting a type into another type of the same type family which contains all possible values of the original type. To illustrate with an example, TINYINT has a range from -128 to 127, and all its possible values can be safely promoted to INTEGER. For a full list of supported SQL data types in Databricks and their type precedence during the conversion process, kindly refer to the link here and our release notes.

A graphical representation of the type precedence hierarchy
A graphical representation of the type precedence hierarchy

2. Table DDLs using Identity Columns

Identity Columns are now GA (Generally Available) in Databricks Runtime 10.4 and beyond. Through identity columns, you can now enable all your data warehousing workloads to have all the benefits of a lakehouse architecture.

An example of a Teradata DDL statement converted to Databricks equivalent
An example of a Teradata DDL statement converted to Databricks equivalent

3. Table DMLs and Function Substitutions

Databricks SQL (DB SQL) supports many standard SQL functions; hence the most commonly used SQL functions in Teradata are also interoperable with DB SQL code without required refactoring. Any Teradata functions not supported in native DB SQL can be handled using User-Defined Functions (UDFs). This link contains an alphabetically ordered list of built-in functions and operators in Databricks.

4. BTEQ Scripts

If you have BTEQ scripts, you must convert them into SQL-based logic wrapped in Python and import them into your Databricks workspace environment as notebooks. A quick summary of the top 5 most common BTEQ functionality, commands and their equivalent converted state in Databricks is shown below:

# Teradata BTEQ Command Databricks Equivalent
1 IMPORT COPY INTO
2 EXPORT INSERT OVERWRITE DIRECTORY
3 RUN dbutils.notebook.run
4 IF THEN Python if block
5 IF, ELSEIF, ELSE, ENDIF Python if…elif…else block

5. Stored Procedures

Stored procedures in Teradata can be converted into Python User-Defined Functions (UDFs) in Databricks. The recommended approach is leveraging the automated code conversion tools mentioned above to accelerate this process. Following is a high-level summary of how most auto conversion tools handle Teradata Stored Procedures and its equivalent functionality in Databricks.

  • CREATE Stored Procedure statements from the input code are converted to Databricks notebooks using Python and SQL in the output
  • Each Stored Procedure maps to an equivalent Databricks notebook.
  • CALL Stored Procedure statements to equivalent dbutils.notebook.run calls with appropriate parameter serialization and return value deserialization

An example of a Teradata Stored Procedure converted to Databricks
An example of a Teradata Stored Procedure converted to Databricks

The tabular matrix below summarizes specific Stored Procedure functionality in Teradata and how to migrate its features into Databricks:

# Teradata Stored
Procedure Construct
Migration Process / Equivalent Component in Databricks
1 SQL Statements Stored Procedures contain SQL statements that undergo the same conversion rules to Databricks as described in this blog's SQL conversion section
2 Parameters Parameters are converted to output Python notebook parameters through Databricks' widgets functionality. Data type conversion from Teradata SQL types to Python types is taken care of in the conversion process
3 Variable declarations Converted to Python variables with appropriate data type conversions
4 IF THEN Converted to Python if block
5 IF, ELSEIF, ELSE, and ENDIF Converted to Python if…elif..else block
6 CASE Converted to Python if…elif..else block
7 CALL statements Stored Procedure CALL statements are converted to dbutils.notebook.run calls with appropriate parameter serialization and return value deserialization.

You can also share the context between different tasks using task values, if you want to break a large piece of code logically and leverage Databricks workflows effectively.

2.3 Data modeling

Apart from the code, if you are worried about migrating your custom data model on Teradata, Databricks supports all data modeling paradigms. You can use that as-is on the Lakehouse. Data Modelers and architects can quickly re-engineer or reconstruct databases and their underlying tables or views on Databricks. You could leverage tools like erwin Data Modeler with the Databricks Lakehouse Platform to serve these needs and reverse engineer using the existing model to fast-track migration to Databricks. We recommend that you follow our blogs for data modeling best practices.

Step 3: Deployment

Now that you have converted your core Teradata logic into Databricks equivalent, you are ready for deployment. There are several best practices of which you should be aware.

3.1 Workspace setup

When designing your workspace, there are various options to set up logical boundaries based on your existing data model, governance model and enterprise architectural design decisions:

  1. Three workspace approaches based on environments - e.g., dev, stage and prod. In addition, we expect it to align with the logical separation of the corresponding schemas for the data.
  2. Isolation by the line of business ( LOB ), and one would have LOB-based dev, stage and prod workspaces within each LOB. You could also have sub-LOBs within each LOB and, within that, different value streams or projects isolate the ownership. This could be aligned with the cloud account setup as well. This strategy works well with enterprises bound by privacy and regulatory requirements.
  3. Create separate workspaces for each team that owns the data. This, in turn, allows each team to control the data it produces and helps ensure data ownership is evident. It works well for enterprises that want to implement data mesh architecture.

For more information, we recommend you follow these best practices on workspace setup.

Once the workspaces are set up, you can test your migrated workloads and deploy them into production. For CI/CD, you can use Databricks Repos and the best practices around it.

We typically help you perform a Total Cost of Ownership(TCO) analysis and consumption planning for the required Databricks Units (DBUs) to support these workloads from a budgeting standpoint. Please get in touch with your Databricks representative for this exercise.

3.2 Orchestration using Workflows

Traditionally, Teradata workloads are orchestrated using schedulers like Control-M, Autosys or similar tools with Unix-based wrapper scripts. Enterprises also embed the ELT logic as part of the enterprise ETL components, which push down the SQL logic on execution.

With Databricks, you can use Workflows out of the box and orchestrate anything anywhere. Workflows are free of cost, and you can orchestrate any combination of notebooks, SQL, Spark, ML models, etc., as a Jobs workflow, including calls to other systems. These Workflows can be scheduled using Databricks scheduler.

As part of the migration activity, you can modernize your Teradata workloads to Databricks and eliminate licensed scheduling tools to adopt the modern data stack as an option entirely. For example, if you have converted your BTEQ scripts into notebooks, you can now orchestrate them as Tasks using Workflows with the required dependencies for deploying the end-to-end pipeline.

3.3 Data validation and user acceptance testing

To deploy workloads successfully into production, you will need to plan for data validation by the end users/business analysts' teams. The business analysts' teams use row counts and summaries of key attributes or metrics of the tables in scope and compare them by running their SQL models on Teradata and Databricks. We recommend that you keep an overlapping window between the two systems for validations side by side. Once the teams sign off with the completion of User Acceptance Testing (UAT), you can plan for a cutover for all the related workloads. Usually, these capabilities are a subset of the BrickBuilder solutions or migration tooling ISV partners and can be easily automated for ease of use and accelerated journey.

Another critical factor during user acceptance testing is meeting the performance SLAs. You will get best-in-class performance by migrating to Databricks with a much lower TCO, as it uses a Photon engine providing high-speed query performance at a low cost for all types of workloads directly on top of the lakehouse.

For more details, please visit this blog post on data-warehousing-performance-record.
To ensure you get the best performance, we recommend you follow the delta optimizations best practices.

3.4 Data governance strategy

Databricks brings fine-grained governance and security to lakehouse data with Unity Catalog. Unity Catalog allows organizations to manage fine-grained data permissions using standard ANSI SQL or a simple UI, enabling them to unlock their lakehouse for consumption safely. It works uniformly across clouds and data types.

Unity Catalog moves beyond managing tables to other data assets, such as machine learning models and files. As a result, enterprises can simplify how they govern all their data and AI assets. It is a critical architectural tenet for enterprises and one of the key reasons customers migrate to Databricks instead of using a traditional data warehousing platform.

In this case, you can easily migrate over the Teradata-based access controls to Databricks using Unity Catalog.

3.5 Repointing BI workloads

One of the critical requirements of a successful Teradata to Databricks migration is ensuring business continuity, enabling adoption and alleviating any downstream impacts. Databricks has validated integrations with your favorite BI tools, including Databricks Dashboards, Power BI, Tableau, Redash, Preset, AWS Quicksight, Looker and others, allowing you to work with data through Databricks SQL warehouses. The general norm for a given set of reports for a given KPI is to ensure all the upstream tables and views are migrated, along with their associated workloads and dependencies.

Assuming the metadata is migrated to Unity Catalog, we could the following approach for seamless repointing of reports, as applicable. Let us assume that the new tables or views under UAT have the suffix _delta. Once the tables/views in scope with UAT are completed, and associated pipelines are migrated, you should rename the existing Teradata tables/views with the suffix ( e.g., _td) and rename the new tables/views (e.g., _delta) to the current table or view names. This approach ensures that end users do not have to refactor the table or view names within the SQL models or reports, and existing reports can be migrated using your automated solution with minimal syntax changes. Note: You could follow an approach with separate database/schema names maintained for the lakehouse, as dictated by your data strategy team as well.

Once you have migrated the 1st set of KPI dashboards or reports, you can now iterate through the remainder of the reporting layer and its migration.

Summary

A seamless migration is an important step to ensure the success of your business outcomes. In the above blog sections, we walked you through the important aspects of completing your migration journey.

Next steps

Many enterprises today are running a hybrid architecture — data warehouses for business analytics and data lakes for machine learning. But with the advent of the data lakehouse, you can now unify both on a single modern platform. The Databricks Lakehouse Platform overcomes traditional MPP data warehouse limitations because it is designed to manage all types of data - structured, semi-structured, and unstructured - and supports traditional BI workloads and Machine Learning / AI natively. It adds all this functionality to your data lake, creating a unified, single, and multicloud platform.

Migrating your Teradata environment to Databricks delivers significant business benefits, including

  • Reduction of operational costs,
  • Increased productivity of your data teams,
  • Unlocking advanced analytics use cases while retaining full data warehouse capabilities.

Please see this EDW Migration page for more information regarding migrating to Databricks and schedule a customized Teradata Migration assessment by reaching out to your Databricks representative.

Try Databricks free for 14 days.

Try Databricks for free

Related posts

Announcing Brickbuilder Solutions for Migrations

August 11, 2022 by Michael Lumb in
Today, we're excited to announce that Databricks has collaborated with key partners globally to launch the first Brickbuilder Solutions for migrations to the...

Four Steps to Migrate a SAS Data Warehouse to the Lakehouse

September 7, 2022 by Mike Perlov and Satish Garla in
Although SAS is known as the analytical platform, it is quite often that an organization uses it for data engineering and even for...

New Built-in Functions for Databricks SQL

Built-in functions extend the power of SQL with specific transformations of values for common needs and use cases. For example, the LOG10 function...

How To Migrate Your Oracle PL/SQL Code to Databricks Lakehouse Platform

February 12, 2023 by Laurent Léturgez, Leo Mao and Soham Bhatt in
Oracle is a well-known technology for hosting Enterprise Data Warehouse solutions. However, many customers like Optum and the U.S. Citizenship and Immigration Services...
See all Platform Blog posts