In legacy relational database administration methods, information is saved in a number of advanced information sorts, such XML, JSON, BLOB, or CLOB. This information may include precious data that’s usually tough to remodel into insights, so that you is likely to be searching for methods to load and use this information in a contemporary cloud information warehouse reminiscent of Amazon Redshift. One such instance is migrating information from a legacy Oracle database with XML BLOB fields to Amazon Redshift, by performing preprocessing and conversion of XML utilizing Amazon EMR. On this put up, we describe an answer structure for this use case, and present you how you can implement the code to deal with the XML conversion.
Resolution overview
Step one in any information migration mission is to seize and ingest the information from the supply database. For this activity, we use AWS Database Migration Service (AWS DMS), a service that helps you migrate databases to AWS rapidly and securely. On this instance, we use AWS DMS to extract information from an Oracle database with XML BLOB fields and stage the identical information in Amazon Easy Storage Service (Amazon S3) in Apache Parquet format. Amazon S3 is an object storage service providing industry-leading scalability, information availability, safety, and efficiency, and is the storage of alternative for establishing information lakes on AWS.
After the information is ingested into an S3 staging bucket, we used Amazon EMR to run a Spark job to carry out the conversion of XML fields to semi-structured fields, and the outcomes are loaded in a curated S3 bucket. Amazon EMR runtime for Apache Spark will be over 3 times quicker than clusters with out EMR runtime, and has 100% API compatibility with customary Apache Spark. This improved efficiency means your workloads run quicker and it saves you compute prices, with out making any modifications to your utility.
Lastly, remodeled and curated information is loaded into Amazon Redshift tables utilizing the COPY command. The Amazon Redshift desk construction ought to match the variety of columns and the column information sorts within the supply file. As a result of we saved the information as a Parquet file, we specify the SERIALIZETOJSON
choice within the COPY command. This permits us to load advanced sorts, reminiscent of construction and array, in a column outlined as SUPER information kind within the desk.
The next structure diagram exhibits the end-to-end workflow.
Intimately, AWS DMS migrates information from the supply database tables into Amazon S3, in Parquet format. Apache Spark on Amazon EMR reads the uncooked information, transforms the XML information kind right into a struct information kind, and saves the information to the curated S3 bucket. In our code, we used an open-source library, referred to as spark-xml, to parse and question the XML information.
In the remainder of this put up, we assume that the AWS DMS duties have already run and created the supply Parquet recordsdata within the S3 staging bucket. If you wish to arrange AWS DMS to learn from an Oracle database with LOB fields, check with Successfully migrating LOB information to Amazon S3 from Amazon RDS for Oracle with AWS DMS or watch the video Migrate Oracle to S3 Knowledge lake by way of AWS DMS.
Stipulations
If you wish to comply with together with the examples on this put up utilizing your AWS account, we offer an AWS CloudFormation template you possibly can launch by selecting Launch Stack:
Present a stack title and depart the default settings for every little thing else. Watch for the stack to show Create Full
(this could solely take a couple of minutes) earlier than shifting on to the opposite sections.
The template creates the next sources:
- A digital personal cloud (VPC) with two personal subnets which have routes to an Amazon S3 VPC endpoint
- The S3 bucket
{stackname}-s3bucket-{xxx}
, which accommodates the next folders:libs
– Comprises the JAR file so as to add to the pocket booknotebooks
– Comprises the pocket book to interactively check the codeinformation
– Comprises the pattern information
- An Amazon Redshift cluster, in one of many two personal subnets, with a database named
rs_xml_db
and a schema namedrs_xml
- A secret (
rs_xml_db
) in AWS Secrets and techniques Supervisor - An EMR cluster
The CloudFormation template shared on this put up is only for demonstration functions solely. Please conduct your personal safety evaluation and incorporate finest practices previous to any manufacturing deployment utilizing artifacts from the put up.
Lastly, some fundamental data of Python and Spark DataFrames might help you evaluation the transformation code, however isn’t necessary to finish the instance.
Understanding the pattern information
On this put up, we use school college students’ course and topics pattern information that we created. Within the supply system, information consists of flat construction fields, like course_id
and course_name
, and an XML area that features all of the course materials and topics concerned within the respective course. The next screenshot is an instance of the supply information, which is staged in an S3 bucket as a prerequisite step.
We will observe that the column study_material_info
is an XML kind area and accommodates nested XML tags in it. Let’s see how you can convert this nested XML area to a generic struct
area within the subsequent steps.
Run a Spark job in Amazon EMR to remodel the XML fields within the uncooked information to Parquet
On this step, we use an Amazon EMR pocket book, which is a managed surroundings to create and open Jupyter Pocket book and JupyterLab interfaces. It allows you to interactively analyze and visualize information, collaborate with friends, and construct purposes utilizing Apache Spark on EMR clusters. To open the pocket book, comply with these steps:
- On the Amazon S3 console, navigate to the bucket you created as a prerequisite step.
- Obtain the file within the notebooks folder.
- On the Amazon EMR console, select Notebooks within the navigation pane.
- Select Create pocket book.
- For Pocket book title, enter a reputation.
- For Cluster, choose Select an current cluster.
- Choose the cluster you created as a prerequisite.
- For Safety Teams, select
BDB1909-EMR-LIVY-SG
andBDB1909-EMR-Pocket book-SG
- For AWS Service Function, select the function
bdb1909-emrNotebookRole-{xxx}
. - For Pocket book location, specify the S3 path within the notebooks folder (
s3://{stackname}-s3bucket-
xxx}/notebooks/
). - Select Create pocket book.
- When the pocket book is created, select Open in JupyterLab.
- Add the file you downloaded earlier.
- Open the brand new pocket book.
The pocket book ought to look as proven within the following screenshot, and it accommodates a script written in Scala. - Run the primary two cells to configure Apache Spark with the open-source spark-xml library and import the wanted modules.The
spark-xml
package deal permits studying XML recordsdata in native or distributed file methods as Spark DataFrames. Though primarily used to transform (parts of) massive XML paperwork right into a DataFrame,spark-xml
can even parse XML in a string-valued column in an current DataFrame with thefrom_xml
perform, so as to add it as a brand new column with parsed outcomes as a struct. - To take action, within the third cell, we load the information from the Parquet file generated by AWS DMS right into a DataFrame, then we extract the attribute that accommodates the XML code (
STUDY_MATERIAL_INFO
) and map it to a string variable titlepayloadSchema
. - We will now use the
payloadSchema
within thefrom_xml
perform to transform the sphereSTUDY_MATERIAL_INFO
right into a struct information kind and added it as a column namedcourse_material
in a brand new DataFrameparsed
. - Lastly, we will drop the unique area and write the parsed DataFrame to our
curated
zone in Amazon S3.
As a result of construction variations between DataFrame and XML, there are some conversion guidelines from XML information to DataFrame and from DataFrame to XML information. Extra particulars and documentation can be found XML Knowledge Supply for Apache Spark.
After we convert from XML to DataFrame, attributes are transformed as fields with the heading prefix attributePrefix
(underscore (_) is the default). For instance, see the next code:
It produces the next schema:
Subsequent, we’ve a price in a component that has no little one parts however attributes. The worth is put in a separate area, valueTag
. See the next code:
It produces the next schema, and the tag lang
is transformed into the _lang
area contained in the DataFrame:
Copy curated information into Amazon Redshift and question tables seamlessly
As a result of our semi-structured nested dataset is already written within the S3 bucket as Apache Parquet formatted recordsdata, we will use the COPY command with the SERIALIZETOJSON
choice to ingest information into Amazon Redshift. The Amazon Redshift desk construction ought to match the metadata of the Parquet recordsdata. Amazon Redshift can change any Parquet columns, together with construction and array sorts, with SUPER information columns.
The next code demonstrates CREATE TABLE instance to create a staging desk.
The next code makes use of the COPY instance to load from Parquet format:
By utilizing semistructured information help in Amazon Redshift, you possibly can ingest and retailer semistructured information in your Amazon Redshift information warehouses. With the SUPER information kind and PartiQL language, Amazon Redshift expands the information warehouse functionality to combine with each SQL and NoSQL information sources. The SUPER information kind solely helps as much as 1 MB of information for a person SUPER area or object. Observe, a person worth inside a SUPER object is proscribed to the utmost size of the corresponding Amazon Redshift kind. For instance, a single string worth loaded to SUPER is proscribed to the utmost VARCHAR size of 65535 bytes. See Limitations for extra particulars.
The next instance exhibits how nested constructions in a SUPER column will be simply accessed utilizing SELECT statements:
The next screenshot exhibits our outcomes.
Clear up
To keep away from incurring future expenses, first delete the pocket book and the associated recordsdata on Amazon S3 bucket as defined in this EMR documentation web page then the CloudFormation stack.
Conclusion
This put up demonstrated how you can use AWS providers like AWS DMS, Amazon S3, Amazon EMR, and Amazon Redshift to seamlessly work with advanced information sorts like XML and carry out historic migrations when constructing a cloud information lake home on AWS. We encourage you to do this resolution and benefit from all the advantages of those purpose-built providers.
When you have questions or solutions, please depart a remark.
Concerning the authors
Abhilash Nagilla is a Sr. Specialist Options Architect at AWS, serving to public sector clients on their cloud journey with a deal with AWS analytics providers. Exterior of labor, Abhilash enjoys studying new applied sciences, watching films, and visiting new locations.
Avinash Makey is a Specialist Options Architect at AWS. He helps clients with information and analytics options in AWS. Exterior of labor he performs cricket, tennis and volleyball in free time.
Fabrizio Napolitano is a Senior Specialist SA for DB and Analytics. He has labored within the analytics house for the final 20 years, and has just lately and fairly unexpectedly develop into a Hockey Dad after shifting to Canada.