Migrate a big information warehouse from Greenplum to Amazon Redshift utilizing AWS SCT

On this third submit of a multi-part collection, we discover among the edge circumstances in migrating a big information warehouse from Greenplum to Amazon Redshift utilizing AWS Schema Conversion Device (AWS SCT) and tips on how to deal with these challenges. Challenges embody how finest to make use of digital partitioning, edge circumstances for numeric and character fields, and arrays.

You may try the first submit of this collection for steerage on planning, working, and validating the migration. It’s also possible to try the second submit for finest practices for selecting the optimum Amazon Redshift cluster, information structure, changing saved procedures, appropriate capabilities and queries broadly used for SQL conversions, and proposals for optimizing the size of knowledge varieties for desk columns.

Unbounded character information sort

Greenplum helps creating columns as textual content and varchar with out specifying the size of the sector. This works with out a difficulty in Greenplum however doesn’t work properly in migrating to Amazon Redshift. Amazon Redshift shops information in columnar format and will get higher compression when utilizing shorter column lengths. Due to this fact, the Amazon Redshift finest apply is to make use of the smallest character size potential.

AWS SCT will convert these unbounded fields as giant objects (LOBs) as an alternative of treating the columns as character fields with a specified size. LOBs are applied otherwise in every database product available on the market, however usually, a LOB is just not saved with the remainder of the desk information. As an alternative, there’s a pointer to the placement of the information. When the LOB is queried, the database reconstitutes the information robotically for you, however this sometimes requires extra assets.

Amazon Redshift doesn’t help LOBs, so AWS SCT resolves this by loading the information into Amazon Easy Storage Service (Amazon S3) and within the column, it shops the S3 URL. When you have to retrieve this information, you need to question the desk, get the S3 URL, after which fetch the information from Amazon S3. This isn’t ideally suited as a result of more often than not, the precise most size of the sector doesn’t require it to be handled as a LOB, and storing the information remotely means it should take for much longer to fetch the information for queries.

The present decision is to calculate the utmost size of those columns and replace the Greenplum tables earlier than changing to Amazon Redshift with AWS SCT.

Word that in a future launch of AWS SCT, the gathering of statistics will embody calculating the utmost size for every column, and the conversion of unbounded varchar and textual content will set the size in Amazon Redshift robotically.

The next code is an instance of an unbounded character information sort:

CREATE TABLE public.mytable 
(description1 textual content NOT NULL PRIMARY KEY, description2 varchar);
CREATE INDEX ON public.mytable (description2);

This desk makes use of a main key column on an unbounded textual content column. This must be transformed to varchar(n), the place n is the utmost size discovered on this column.

Drop distinctive constraints on affected columns:

ALTER TABLE public.mytable DROP CONSTRAINT mytable_pkey;

Drop indexes on affected columns:

DROP INDEX public.mytable_description2_idx;

Calculate most size of affected columns:

choose coalesce(max(size(description1)), 10),
coalesce(max(size(description2)), 10) 
from public.mytable;

Word that on this instance, the description1 and description2 columns solely comprise NULL values, or the desk doesn’t have any information in it, or the calculated size of the columns is 10.

Alter the size of the affected columns:

ALTER TABLE public.mytable ALTER COLUMN description1 TYPE varchar(10);
ALTER TABLE public.mytable ALTER COLUMN description2 TYPE varchar(10);

Now you can proceed with utilizing AWS SCT to transform the Greenplum schema to Amazon Redshift and avoiding utilizing LOBs to retailer the column values.

GitHub assist

If in case you have many tables to replace and need an automatic resolution, you should utilize the add_varchar_lengths.sh script discovered within the GitHub repo to repair the entire unbounded varchar and textual content columns in a given schema in Greenplum. The script calculates the suitable most size after which alters the Greenplum tables so the varchar information sort is bounded by a size.

Please notice that the script additionally will drop any constraints or indexes on the affected columns.

Empty character information

Greenplum and Amazon Redshift help an empty string worth in a subject that’s completely different from NULL. The conduct is identical between the 2 databases. Nevertheless, AWS SCT defaults to transform empty strings to NULL. This merely must be disabled to keep away from issues.

In AWS SCT, open your mission, select Settings, Mission settings, and Knowledge migration.
Scroll to the underside and discover Use empty as null worth.
Deselect this in order that AWS SCT doesn’t convert empty strings to NULL.

NaN and Infinity numeric information sort

Greenplum helps NaN and Infinity in a numeric subject to signify an undefined calculation outcome and infinity. NaN could be very unusual as a result of when utilizing combination capabilities on a column with a NaN row, the outcome may even be NaN. Infinity can be unusual and never helpful when aggregating information. Nevertheless, you might encounter these values in a Greenplum database.

Amazon Redshift doesn’t help NaN and Infinity, and AWS SCT doesn’t test for this in your information. In case you do encounter this when utilizing AWS SCT, the duty will fail with a numeric conversion error.

To resolve this, it’s advised to make use of NULL as an alternative of NaN and Infinity. This lets you combination information and get outcomes aside from NaN and, importantly, help you convert the Greenplum information to Amazon Redshift.

The next code is an instance NaN numeric worth:

CREATE TABLE public.mytable2 (id int NOT NULL, quantity numeric NOT NULL);
INSERT INTO public.mytable2 VALUES (1, 10), (2, 'NaN'), (3, 20);

Drop the NOT NULL constraint:

ALTER TABLE public.mytable2 ALTER COLUMN quantity DROP NOT NULL;

Replace the desk:

UPDATE public.mytable2 SET quantity = NULL the place quantity="NaN";

Now you can proceed with utilizing AWS SCT emigrate the Greenplum information to Amazon Redshift.

Word that in a future launch of AWS SCT, there can be an choice to convert NaN and Infinity to NULL so that you simply gained’t need to replace your Greenplum information emigrate to Amazon Redshift.

Digital partitioning on GP_SEGMENT_ID

For big tables, it’s really helpful to make use of digital partitioning to extract information from Greenplum. With out digital partitioning, AWS SCT will run a single question to unload information from Greenplum. For instance:

SELECT * FROM store_sales;

If this desk could be very giant, it should take a very long time to extract the information as a result of this can be a single course of querying the information. With digital partitioning, a number of queries are run in parallel in order that the extraction of knowledge is accomplished quicker. It additionally makes it simpler to get better if there is a matter with the duty.

Digital partitioning could be very versatile, however a easy approach to do that in Amazon Redshift is to make the most of the Greenplum hidden column gp_segment_id. This column identifies which phase in Greenplum has the information, and every phase ought to have an equal variety of rows. Due to this fact, creating partitions for every gp_segment_id is a simple technique to implement digital partitioning.

In case you’re not accustomed to the time period phase, it’s just like an Amazon Redshift slice.

For instance:

SELECT * FROM store_sales WHERE gp_segment_id = 0;
SELECT * FROM store_sales WHERE gp_segment_id = 1;
SELECT * FROM store_sales WHERE gp_segment_id = 2;
...

First, decide the variety of segments in Greenplum:

SELECT rely(*) 
FROM gp_segment_configuration 
WHERE content material >= 0 AND preferred_role="p";

Now you’ll be able to configure AWS SCT.

In AWS SCT, go to Knowledge Migration view (different) and select (right-click) a big desk.
Scroll all the way down to Add digital partitioning.
For the partition sort, select Auto Cut up and alter the column title to GP_SEGMENT_ID.
Use 0 for Begin worth, the variety of segments present in Step 1 as Finish worth, and Interval of 1.

Whenever you create an area activity to load this desk, the duty can have a sub-task for every gp_segment_id worth.

Word that in a future launch of AWS SCT, there can be an choice to robotically nearly partition tables primarily based on GP_SEGMENT_ID. This selection may even retrieve the variety of segments robotically.

Arrays

Greenplum helps arrays resembling bigint[] which might be unbounded. Sometimes, arrays are stored comparatively small in Greenplum as a result of arrays devour extra reminiscence in Greenplum than utilizing an alternate technique. Nevertheless, it’s potential to have a really giant array in Greenplum that isn’t supported by Amazon Redshift.

AWS SCT converts a Greenplum array to varchar(65535), but when the transformed array is longer than 65,535 characters, then the load will fail.

The next code is an instance of a giant array:

CREATE TABLE public.gross sales 
(sales_id int NOT NULL,
customer_id int NOT NULL,
sales_item_ids bigint[]) 
DISTRIBUTED BY (sales_id);

INSERT INTO public.gross sales values (1, 100, '{1,2,3}'), (2, 100, '{1,2,3}');

On this instance, the gross sales objects are saved in an array for every sales_id. In case you encounter an error whereas loading that the size is just too lengthy to load this information into Amazon Redshift with AWS SCT, then that is the answer. It’s additionally a extra environment friendly sample to retailer information in each Greenplum and Amazon Redshift!

Create a brand new gross sales desk that has all columns from the prevailing gross sales desk, however exclude the array column:
```
CREATE TABLE public.sales_new 
(sales_id int NOT NULL,
customer_id int NOT NULL) 
DISTRIBUTED BY (sales_id);
```

Populate the brand new gross sales desk with the prevailing information aside from the array column:

INSERT INTO public.sales_new (sales_id, customer_id) 
SELECT sales_id, customer_id FROM public.gross sales;

We create a brand new desk that may be a cross-reference of gross sales IDs with the gross sales objects. As an alternative of getting a single row for this affiliation, now there can be a row for every relationship.

Create a brand new gross sales merchandise desk:

CREATE TABLE public.sales_items 
(sales_id int NOT NULL, 
sales_item_id bigint NOT NULL) 
DISTRIBUTED BY (sales_id);

To unnest the array, create a row for every array aspect:

INSERT INTO public.sales_items 
(sales_id, sales_item_id) 
SELECT sales_id, unnest(sales_item_ids) 
FROM public.gross sales;

Rename the gross sales tables:

ALTER TABLE public.gross sales RENAME TO sales_old;
ALTER TABLE public.sales_new RENAME TO gross sales;

In AWS SCT, refresh the tables and migrate the revised gross sales and the brand new sales_items desk.

The next are some instance queries earlier than and after.

Earlier than:

SELECT s.sales_id, unnest(s.sales_item_ids) 
FROM public.gross sales s 
WHERE s.sales_id = 1;

After:

SELECT s.sales_id, i.sales_item_id 
FROM public.gross sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.sales_id = 1;

Earlier than:

SELECT s.sales_id 
FROM public.gross sales s 
WHERE s.customer_id = 100
AND 10 = ANY(s.sales_item_ids);

After:

SELECT s.sales_id
FROM public.gross sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.customer_id = 100
AND i.sales_item_id = 10;

VACUUM ANALYZE

Greenplum, like Amazon Redshift, helps the VACUUM command, which reclaims cupboard space after UPDATE and DELETE instructions are run on a desk. Greenplum additionally means that you can add the ANALYZE choice to run each statements with a single command.

The next code is the Greenplum command:

This isn’t quite common, however you’ll see this every so often. In case you’re simply inserting information right into a desk, there is no such thing as a must run VACUUM, however for ease of use, typically builders will use VACUUM ANALYZE.

The next are the Amazon Redshift instructions:

VACUUM desk;
ANALYZE desk;

Amazon Redshift doesn’t help including ANALYZE to the VACUUM command, so as an alternative, this must be two completely different statements. Additionally notice that Amazon Redshift performs VACUUM and ANALYZE robotically for you so typically, you’ll be able to take away these instructions out of your scripts fully.

DISTINCT ON question

Greenplum helps an uncommon shortcut for eliminating duplicates in a desk. This characteristic retains the primary row for every set of rows primarily based on the order of the information being fetched. It’s best to know by an instance:

CREATE TABLE buyer 
(customer_name varchar(100) not null, 
customer_address varchar(1000) not null, 
lastupdate timestamp not null);

INSERT INTO buyer VALUES
('ACME', '123 Predominant St', '2022-01-01'), 
('ACME', '456 Market St', '2022-05-01'), 
('ACME', '789 Broadway', '2022-08-01');

SELECT DISTINCT ON (customer_name) customer_name, customer_address 
FROM buyer 
ORDER BY customer_name, lastupdate DESC;

We get the next outcomes:

customer_name | customer_address 
---------------+------------------
 ACME          | 789 Broadway

The answer for working this in Amazon Redshift is to make use of the ANSI normal row_number() analytical perform, as proven within the following code:

SELECT sub.customer_name, sub.customer_address 
FROM (SELECT customer_name, customer_address, row_number() over (partition by customer_name ORDER BY lastupdate DESC) AS row_number FROM buyer) AS sub 
WHERE sub.row_number = 1;

Clear up

The examples on this submit create tables in Greenplum. To take away these instance tables, run the next instructions:

DROP TABLE IF EXISTS public.mytable;
DROP TABLE IF EXISTS public.mytable2;
DROP TABLE IF EXISTS public.gross sales;
DROP TABLE IF EXISTS public.sales_new;
DROP TABLE IF EXISTS public.sales_items;
DROP TABLE IF EXISTS public.buyer;

Conclusion

On this submit, we coated among the edge circumstances when migrating Greenplum to Amazon Redshift and tips on how to deal with these challenges, together with simple digital partitioning, edge circumstances for numeric and character fields, and arrays. This isn’t an exhaustive record of migrating Greenplum to Amazon Redshift, however this collection ought to allow you to navigate modernizing your information platform by transferring to Amazon Redshift.

For extra particulars, see the Amazon Redshift Getting Began Information and the AWS SCT Consumer Information.

In regards to the Authors

Jon Roberts is a Sr. Analytics Specialist primarily based out of Nashville, specializing in Amazon Redshift. He has over 27 years of expertise working in relational databases. In his spare time, he runs.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical expertise specializing in migrating and replicating databases together with information warehouse workloads. She is keen about serving to prospects of their cloud journey.

Suresh Patnam is a Principal BDM – GTM AI/ML Chief at AWS. He works with prospects to construct IT technique, making digital transformation via the cloud extra accessible by leveraging Knowledge & AI/ML. In his spare time, Suresh enjoys taking part in tennis and spending time together with his household.