The 3Ds of Migrating Teradata Workloads to the Databricks Lakehouse Platform

Many giant enterprises have used Teradata information warehouses for years, however the storage and processing prices of on-premises infrastructure severely restricted who might use the useful resource and for what functions. On high of that, an improve is an extended course of, plus Teradata must ship the purchasers the {hardware} and set up it within the buyer’s information middle within the occasion of an outage.

Migrating your legacy Teradata information warehouse to the Databricks Lakehouse Platform can speed up your information modernization journey. Nonetheless, it could appear complicated and daunting, requiring a well-orchestrated and deliberate effort. Throughout the preliminary scoping part, it’s possible you’ll uncover that your group has collected lots of of jobs and 1000’s of SQL scripts over time.

Nonetheless, worry not! Enterprise clients like Walgreens, Sam’s Membership and lots of others have efficiently migrated their legacy Teradata Information Warehouse to Databricks, enabling them to save lots of hundreds of thousands in infrastructure financial savings whereas on the identical time accelerating innovation and productiveness. This weblog publish presents a scientific technique to speed up your journey towards adopting the lakehouse in a framework encapsulated in an acronym, 3Ds: Discovery, Improvement, and Deployment.

In case you are questioning how Databricks is completely different from Teradata, the abstract matrix beneath illustrates how the Databricks Lakehouse Platform compares to a conventional information warehouse like Teradata:

Capabilities comparability

	Databricks	On-Prem Teradata
Information format	Delta (open supply)	Proprietary
Information sorts	Structured, Semi-structured, Unstructured	Structured, Semi-structured
Languages supported	SQL, Python, Scala, R	SQL solely
Use instances	BI, SQL, Machine Studying/ Information Science, Actual-Time Streaming	BI, SQL
Reliability	Excessive-quality, dependable information with ACID transactions	Excessive-quality, dependable information with ACID transactions
Governance	Wonderful-grained entry management for tables, rows, columns with Unity Catalog	Wonderful-grained entry management for tables, rows, columns
Architectural paradigm	Lakehouse Platform	Conventional Information Warehouse
Licensing Mannequin	Consumption based mostly	Annual subscription + Further function(s) + Help/ Upkeep + Upgrades value

Step 1: Discovery

Profile the Teradata Surroundings

Step one within the migration course of is to comprehensively perceive your Teradata surroundings to find out the general scope and energy required for the initiative. Beneath are two key questions you’d wish to find out about your Teradata deployment:

Query #1: What does my Teradata topology appear like concerning {hardware} specs? (e.g., storage utilization, system utilization, warehouse objects info, question sorts)

To reply this query, extracting and analyzing syslogs out of your Teradata warehouse surroundings is a good place to start out. To speed up this course of the Databricks migrations group has developed a Teradata Profiler device to speed up this course of. The profiler makes use of Teradata’s system tables and repair known as PDCR (Efficiency Information Assortment and Reporting), a knowledge utility that gives an understanding of system efficiency, workload utilization, and administration. This migration evaluation utility helps you mechanically uncover and profile the present Teradata Warehouse surroundings. As well as, the device additionally helps in DBU (Databricks Unit) forecasting. The screenshot beneath illustrates the important thing insights generated by the Teradata Profiler dashboard (for extra info on operating the profiler, please get in contact along with your Databricks consultant):

This is a sample Teradata profiler dashboard — It is a pattern Teradata profiler dashboard

Query #2: What number of workloads do I must migrate to Databricks? How simple or complicated are my surroundings’s jobs and code patterns?

To reply this second query, it’s essential to assessment the precise code, scripts, information warehouse objects, and jobs collected in your Teradata warehouse surroundings and create a abstract stock of those artifacts. To speed up this evaluation course of, Databricks extremely recommends using a code profiler or analyzer (e.g., BladeBridge, LeapLogic, DataMetica and so on.) or answer accelerators constructed by certainly one of our BrickBuilder Migration Resolution companions. These instruments usually present detailed reviews of what is inside your surroundings and break down information pipelines and jobs into varied buckets based mostly on complexity. It lets you scope out the hassle required for the migration initiative and any code refactoring that could be vital throughout this course of.

Along with analyzing jobs complexity, these analyzer instruments produce a number of helpful summaries, together with a listed stock of property and artifacts within the Teradata surroundings; examples embody

SQL scripts
Dynamic SQL
Features,
Referenced Objects
Program-Object Cross Reference
Whole Traces of Code

Sample summary output of the BladeBridge Code Analyzer — Pattern abstract output of the BladeBridge Code Analyzer

The analyzer supplies a superb understanding of your Teradata warehouse surroundings by auto-analyzing the code/scripts, as you are able to do an in depth migration evaluation and energy estimation. You might be able to embark on the subsequent step in your migration journey!

Step 2: Improvement

Now that you’ve assessed your Teradata workloads within the discovery step, the subsequent step is the precise migration of historic information and related workloads to the Databricks Lakehouse Platform. This part will stroll you thru the event actions to attain that.

2.1 Information Warehouse Extraction

To get began with information migration, the Teradata Parallel Transporter (TPT) is a shopper utility that gives scalable, high-speed, and parallel information extraction and loading. Utilizing TPT, you possibly can extract the info from all of your tables in Teradata at scale after which push the info information into cloud object shops corresponding to AWS S3, Azure Information Lake Storage, or Google Cloud Storage. Using TPT to unload information gives a number of important advantages listed beneath:

Skill to outline area delimiter, date codecs, and encoding kind
Management to find out useful resource allocation for information unloading to attain higher efficiency
Outline the variety of generated output information and their corresponding file kind
Helps checkpointing and resume operations in case of failures and interruptions

Alternatively, you should use an ETL device of your option to carry out the above operation as properly. You possibly can push these extracted information to the cloud storage utilizing cloud-native CLI or managed providers or use any open supply/third-party ingestion instruments.

As soon as the extracted load-ready information in csv or textual content codecs have landed on the cloud storage, you should use Databricks Autoloader for computerized incremental ingestion. It’ll deal with the historic information ingestion.

Diagram illustrating the transfer of data from Teradata to AWS S3 — Diagram illustrating the switch of information from Teradata to AWS S3

From an incremental load standpoint, you have to be certain that the ingestion course of pushes the information to the cloud storage location for all of the tables in scope. Normally, it is a CDC supply, which is taken care of by ingestion instruments like Fivetran (HVR), Airbyte, Debezium, Azure Information Manufacturing facility, AWS DMS or others., relying in your selection of ingestion tooling and supply system(s). Within the case of current logic utilizing MLoad, TPT or Quick Load scripts, the place you might be performing incremental masses into Teradata immediately, that may be taken care of as a part of MERGE INTO performance in Databricks.

2.2 Code conversion and pipeline improvement

When changing code out of your Teradata Warehouse surroundings to Databricks, the first aim is to leverage automated strategies as a lot as attainable. The conversion of the Teradata logic and performance utilizing certainly one of our migration tooling ISV companions or a BrickBuilder answer simplifies and accelerates the modernization effort to a big extent. As a finest apply for migration, we suggest that you just group associated code belonging to a knowledge utility end-to-end or topic space collectively and hint it backwards from the reporting layer to the bottom tables.

Migrating code that has collected over time could appear to be an intimidating and daunting activity. Let’s break them down into 4 main classes listed beneath and discover every space in additional element to strategy the code migration systematically:

Information Sort Conversion
Desk DDLs
Desk DMLs
BTEQ Scripts
Saved Procedures

Teradata has its dialect of the SQL language however conforms intently to the ANSI SQL that Databricks adheres to. Beneath are the indicative pointers for code conversion between Teradata and Databricks:

1. Information Sort conversion

The conversion of SQL information sorts from Teradata to Databricks is simple, as a result of ANSI-compliance of Databricks SQL. DDL statements and scripts in Teradata may be ported over to Databricks seamlessly, with most supply information sorts being retained.

On sure events, the method of kind promotion will happen, which is the method of casting a kind into one other kind of the identical kind household which incorporates all attainable values of the unique kind. For example with an instance, TINYINT has a variety from -128 to 127, and all its attainable values may be safely promoted to INTEGER. For a full listing of supported SQL information sorts in Databricks and their kind priority through the conversion course of, kindly check with the hyperlink right here and our launch notes.

A graphical representation of the type precedence hierarchy — A graphical illustration of the sort priority hierarchy

2. Desk DDLs utilizing Id Columns

Id Columns are actually GA (Usually Obtainable) in Databricks Runtime 10.4 and past. By identification columns, now you can allow all of your information warehousing workloads to have all the advantages of a lakehouse structure.

An example of a Teradata DDL statement converted to Databricks equivalent — An instance of a Teradata DDL assertion transformed to Databricks equal

3. Desk DMLs and Operate Substitutions

Databricks SQL (DB SQL) helps many commonplace SQL features; therefore essentially the most generally used SQL features in Teradata are additionally interoperable with DB SQL code with out required refactoring. Any Teradata features not supported in native DB SQL may be dealt with utilizing Consumer-Outlined Features (UDFs). This hyperlink incorporates an alphabetically ordered listing of built-in features and operators in Databricks.

4. BTEQ Scripts

You probably have BTEQ scripts, you have to convert them into SQL-based logic wrapped in Python and import them into your Databricks workspace surroundings as notebooks. A fast abstract of the highest 5 most typical BTEQ performance, instructions and their equal transformed state in Databricks is proven beneath:

#	Teradata BTEQ Command	Databricks Equal
1	IMPORT	COPY INTO
2	EXPORT	INSERT OVERWRITE DIRECTORY
3	RUN	dbutils.pocket book.run
4	IF THEN	Python if block
5	IF, ELSEIF, ELSE, ENDIF	Python if…elif…else block

5. Saved Procedures

Saved procedures in Teradata may be transformed into Python Consumer-Outlined Features (UDFs) in Databricks. The advisable strategy is leveraging the automated code conversion instruments talked about above to speed up this course of. Following is a high-level abstract of how most auto conversion instruments deal with Teradata Saved Procedures and its equal performance in Databricks.

CREATE Saved Process statements from the enter code are transformed to Databricks notebooks utilizing Python and SQL within the output
Every Saved Process maps to an equal Databricks pocket book.
CALL Saved Process statements to equal dbutils.pocket book.run calls with acceptable parameter serialization and return worth deserialization

An example of a Teradata Stored Procedure converted to Databricks — An instance of a Teradata Saved Process transformed to Databricks

The tabular matrix beneath summarizes particular Saved Process performance in Teradata and methods to migrate its options into Databricks:

#	Teradata Saved Process Assemble	Migration Course of / Equal Part in Databricks
1	SQL Statements	Saved Procedures include SQL statements that bear the identical conversion guidelines to Databricks as described on this weblog’s SQL conversion part
2	Parameters	Parameters are transformed to output Python pocket book parameters by Databricks’ widgets performance. Information kind conversion from Teradata SQL sorts to Python sorts is taken care of within the conversion course of
3	Variable declarations	Transformed to Python variables with acceptable information kind conversions
4	IF THEN	Transformed to Python if block
5	IF, ELSEIF, ELSE, and ENDIF	Transformed to Python if…elif..else block
6	CASE	Transformed to Python if…elif..else block
7	CALL statements	Saved Process CALL statements are transformed to dbutils.pocket book.run calls with acceptable parameter serialization and return worth deserialization. You may as well share the context between completely different duties utilizing activity values, if you wish to break a big piece of code logically and leverage Databricks workflows successfully.

2.3 Information modeling

Other than the code, in case you are anxious about migrating your customized information mannequin on Teradata, Databricks helps all information modeling paradigms. You should use that as-is on the Lakehouse. Information Modelers and designers can rapidly re-engineer or reconstruct databases and their underlying tables or views on Databricks. You can leverage instruments like erwin Information Modeler with the Databricks Lakehouse Platform to serve these wants and reverse engineer utilizing the present mannequin to fast-track migration to Databricks. We suggest that you just observe our blogs for information modeling finest practices.

Step 3: Deployment

Now that you’ve transformed your core Teradata logic into Databricks equal, you might be prepared for deployment. There are a number of finest practices of which you have to be conscious.

3.1 Workspace setup

When designing your workspace, there are numerous choices to arrange logical boundaries based mostly in your current information mannequin, governance mannequin and enterprise architectural design choices:

Three workspace approaches based mostly on environments – e.g., dev, stage and prod. As well as, we count on it to align with the logical separation of the corresponding schemas for the info.
Isolation by the road of enterprise ( LOB ), and one would have LOB-based dev, stage and prod workspaces inside every LOB. You can even have sub-LOBs inside every LOB and, inside that, completely different worth streams or initiatives isolate the possession. This could possibly be aligned with the cloud account setup as properly. This technique works properly with enterprises certain by privateness and regulatory necessities.
Create separate workspaces for every group that owns the info. This, in flip, permits every group to regulate the info it produces and helps guarantee information possession is clear. It really works properly for enterprises that wish to implement information mesh structure.

For extra info, we suggest you observe these finest practices on workspace setup.

As soon as the workspaces are arrange, you possibly can take a look at your migrated workloads and deploy them into manufacturing. For CI/CD, you should use Databricks Repos and the finest practices round it.

We usually aid you carry out a Whole Price of Possession(TCO) evaluation and consumption planning for the required Databricks Items (DBUs) to assist these workloads from a budgeting standpoint. Please get in contact along with your Databricks consultant for this train.

3.2 Orchestration utilizing Workflows

Historically, Teradata workloads are orchestrated utilizing schedulers like Management-M, Autosys or related instruments with Unix-based wrapper scripts. Enterprises additionally embed the ELT logic as a part of the enterprise ETL elements, which push down the SQL logic on execution.

With Databricks, you should use Workflows out of the field and orchestrate something anyplace. Workflows are freed from value, and you’ll orchestrate any mixture of notebooks, SQL, Spark, ML fashions, and so on., as a Jobs workflow, together with calls to different programs. These Workflows may be scheduled utilizing Databricks scheduler.

As a part of the migration exercise, you possibly can modernize your Teradata workloads to Databricks and get rid of licensed scheduling instruments to undertake the fashionable information stack as an choice solely. For instance, if in case you have transformed your BTEQ scripts into notebooks, now you can orchestrate them as Duties utilizing Workflows with the required dependencies for deploying the end-to-end pipeline.

3.3 Information validation and person acceptance testing

To deploy workloads efficiently into manufacturing, you’ll need to plan for information validation by the tip customers/enterprise analysts’ groups. The enterprise analysts’ groups use row counts and summaries of key attributes or metrics of the tables in scope and evaluate them by operating their SQL fashions on Teradata and Databricks. We suggest that you just preserve an overlapping window between the 2 programs for validations aspect by aspect. As soon as the groups log out with the completion of Consumer Acceptance Testing (UAT), you possibly can plan for a cutover for all of the associated workloads. Normally, these capabilities are a subset of the BrickBuilder options or migration tooling ISV companions and may be simply automated for ease of use and accelerated journey.

One other important issue throughout person acceptance testing is assembly the efficiency SLAs. You’re going to get best-in-class efficiency by migrating to Databricks with a a lot decrease TCO, because it makes use of a Photon engine offering high-speed question efficiency at a low value for every type of workloads straight on high of the lakehouse.

For extra particulars, please go to this weblog publish on data-warehousing-performance-record.
To make sure you get the most effective efficiency, we suggest you observe the delta optimizations finest practices.

3.4 Information governance technique

Databricks brings fine-grained governance and safety to lakehouse information with Unity Catalog. Unity Catalog permits organizations to handle fine-grained information permissions utilizing commonplace ANSI SQL or a easy UI, enabling them to unlock their lakehouse for consumption safely. It really works uniformly throughout clouds and information sorts.

Unity Catalog strikes past managing tables to different information property, corresponding to machine studying fashions and information. Consequently, enterprises can simplify how they govern all their information and AI property. It’s a important architectural tenet for enterprises and one of many key causes clients migrate to Databricks as an alternative of utilizing a conventional information warehousing platform.

On this case, you possibly can simply migrate over the Teradata-based entry controls to Databricks utilizing Unity Catalog.

3.5 Repointing BI workloads

One of many important necessities of a profitable Teradata to Databricks migration is guaranteeing enterprise continuity, enabling adoption and assuaging any downstream impacts. Databricks has validated integrations along with your favourite BI instruments, together with Databricks Dashboards, Energy BI, Tableau, Redash, Preset, AWS Quicksight, Looker and others, permitting you to work with information by Databricks SQL warehouses. The overall norm for a given set of reviews for a given KPI is to make sure all of the upstream tables and views are migrated, together with their related workloads and dependencies.

Assuming the metadata is migrated to Unity Catalog, we might the next strategy for seamless repointing of reviews, as relevant. Allow us to assume that the brand new tables or views beneath UAT have the suffix _delta. As soon as the tables/views in scope with UAT are accomplished, and related pipelines are migrated, it is best to rename the present Teradata tables/views with the suffix ( e.g., _td) and rename the brand new tables/views (e.g., _delta) to the present desk or view names. This strategy ensures that finish customers do not need to refactor the desk or view names inside the SQL fashions or reviews, and current reviews may be migrated utilizing your automated answer with minimal syntax modifications. Be aware: You can observe an strategy with separate database/schema names maintained for the lakehouse, as dictated by your information technique group as properly.

After getting migrated the first set of KPI dashboards or reviews, now you can iterate by the rest of the reporting layer and its migration.

Abstract

A seamless migration is a vital step to make sure the success of what you are promoting outcomes. Within the above weblog sections, we walked you thru the necessary points of finishing your migration journey.

Subsequent steps

Many enterprises immediately are operating a hybrid structure — information warehouses for enterprise analytics and information lakes for machine studying. However with the arrival of the info lakehouse, now you can unify each on a single trendy platform. The Databricks Lakehouse Platform overcomes conventional MPP information warehouse limitations as a result of it’s designed to handle all sorts of information – structured, semi-structured, and unstructured – and helps conventional BI workloads and Machine Studying / AI natively. It provides all this performance to your information lake, making a unified, single, and multicloud platform.

Migrating your Teradata surroundings to Databricks delivers vital enterprise advantages, together with

Discount of operational prices,
Elevated productiveness of your information groups,
Unlocking superior analytics use instances whereas retaining full information warehouse capabilities.

Please see this EDW Migration web page for extra info concerning migrating to Databricks and schedule a personalized Teradata Migration evaluation by reaching out to your Databricks consultant.

Strive Databricks free for 14 days.

The 3Ds of Migrating Teradata Workloads to the Databricks Lakehouse Platform

Capabilities comparability

Step 1: Discovery

Profile the Teradata Surroundings

Step 2: Improvement

2.1 Information Warehouse Extraction

2.2 Code conversion and pipeline improvement

2.3 Information modeling

Step 3: Deployment

3.1 Workspace setup

3.2 Orchestration utilizing Workflows

3.3 Information validation and person acceptance testing

3.4 Information governance technique

3.5 Repointing BI workloads

Abstract

Subsequent steps

About the author

admin

Leave a Comment X

Capabilities comparability

Step 1: Discovery

Profile the Teradata Surroundings

Step 2: Improvement

2.1 Information Warehouse Extraction

2.2 Code conversion and pipeline improvement

2.3 Information modeling

Step 3: Deployment

3.1 Workspace setup

3.2 Orchestration utilizing Workflows

3.3 Information validation and person acceptance testing

3.4 Information governance technique

3.5 Repointing BI workloads

Abstract

Subsequent steps

You may also like

About the author

admin

Leave a Comment X