Introducing MongoDB Atlas metadata assortment with AWS Glue crawlers

For information lake clients who want to find petabytes of knowledge, AWS Glue crawlers are a well-liked option to uncover and catalog information within the background. This enables customers to go looking and discover related information from a number of information sources. Many purchasers even have information in managed operational databases akin to MongoDB Atlas and wish to mix it with information from Amazon Easy Storage Service (Amazon S3) information lakes to derive insights. AWS Glue crawlers now help MongoDB Atlas, making it less complicated so that you can perceive MongoDB collections’ evolution and extract significant insights.

AWS Glue is a serverless information integration service that makes it easy to find, put together, transfer, and combine information from a number of sources for analytics, machine studying (ML), and software growth.

MongoDB Atlas is a developer information service from AWS know-how accomplice MongoDB, Inc. The service combines transactional processing, relevance-based search, real-time analytics, and mobile-to-cloud information synchronization in an built-in structure.

With at the moment’s launch, you may create and schedule an AWS Glue crawler to crawl MongoDB Atlas. Within the crawler setup, you may choose MongoDB as an information supply. You may then create an AWS Glue reference to MongoDB Atlas and supply the MongoDB Atlas cluster title and credentials. We stroll you thru this course of on this publish.

Resolution overview

The next structure illustrates how one can scan a MongoDB Atlas database and collections utilizing AWS Glue.

With every run of the crawler, the crawler inspects specified collections and catalogs data, akin to updates or deletes to MongoDB Atlas collections, views, and materialized views within the AWS Glue Information Catalog. In AWS Glue Studio, you may then use the AWS Glue Information Catalog as a supply to tug information from MongoDB Atlas and populate an Amazon S3 goal. Lastly, this job can run and browse information from MongoDB Atlas and write the outcomes to Amazon S3, opening up potentialities to combine with AWS providers akin to Amazon SageMaker, Amazon QuickSight, and extra.

Within the following sections, we describe tips on how to create an AWS Glue crawler with MongoDB Atlas as an information supply. We then create an AWS Glue connection and supply the MongoDB Atlas cluster data and credentials. Then we specify the MongoDB Atlas database and collections to crawl.

Stipulations

To comply with together with this publish, you need to have entry to MongoDB Atlas and the AWS Administration Console. We additionally assume you will have entry to a VPC with subnets preconfigured by way of Amazon Digital Non-public Cloud (Amazon VPC). The crawler that we configure later within the publish runs within the VPC and connects to MongoDB Atlas by way of an AWS PrivateLink endpoint.

Arrange MongoDB Atlas

To configure MongoDB Atlas, full the next steps:

Configure a MongoDB cluster on AWS. For directions, seek advice from How you can Set Up a MongoDB Cluster.
Configure PrivateLink by following the steps described in Connecting Purposes Securely to a MongoDB Atlas Information Airplane with AWS PrivateLink.

This enables us to simplify our networking structure and ensure the site visitors stays on the AWS community.

Subsequent, we receive the MongoDB cluster connection string from the Join UI on the MongoDB Atlas console.

On the MongoDB Atlas console, select Join, Non-public Endpoint, and Connection Methodology.
Copy the SRV connection string.

We use this SRV connection string within the subsequent steps.

The next screenshot exhibits that we have now loaded a pattern assortment in MongoDB Atlas, which we crawl over within the subsequent steps. Word that the information on this assortment embody a number of arrays in addition to nested information.

Arrange the MongoDB Atlas reference to AWS Glue

Earlier than we are able to configure the AWS Glue crawler, we have to create the MongoDB Atlas connection in AWS Glue.

On the AWS Glue Studio console, select Connectors within the navigation pane.
Select Create connection.

When filling out the connection particulars, use the SRV connection string we obtained earlier in MongoDB Atlas.
Within the Community choices part, the VPC and subnets should correspond to the PrivateLink settings you configured earlier.

Create a MongoDB crawler

After we create the connection, we are able to create an AWS Glue crawler.

On the AWS Glue console, select Crawlers within the navigation pane.
Select Create crawler.

For Identify, enter a reputation.
For the info supply, select the MongoDB Atlas information supply we configured earlier and provide the trail that corresponds to the MongoDB Atlas database and assortment.

Configure your safety settings, output, and scheduling.

On the Crawlers web page, select Run crawler.

After the crawler finishes crawling the MongoDB collections, its standing exhibits as Accomplished.

Evaluate the MongoDB AWS Glue database and desk

We will navigate to the AWS Glue Information Catalog to look at the tables that had been created by the crawler.

Select the desk to view the schema and different metadata.

Word that the crawler captured nested information as a STRUCT and accurately listed the ARRAY fields.

Import MongoDB Atlas information to Amazon S3

Now we use the MongoDB Atlas-based AWS Glue Information Catalog desk to carry out an information import with out writing code. We use AWS Glue Studio to construct boilerplate code shortly. Alternatively, you may construct the script in script editor.

On the AWS Glue Studio console, select Jobs within the navigation pane.
Select Create job.
Choose Visible with a supply and goal.
Select the Information Catalog desk because the supply and Amazon S3 because the goal.

Within the AWS Glue Studio UI, provide extra parameters such because the S3 bucket title and select the database and desk from the drop-down menus.

Subsequent, evaluation the generated script that’s constructed by AWS Glue Studio. We now want so as to add a database and assortment within the script as follows:

additional_options = {"database": "sample_airbnb","assortment": "listingsAndReviews"},

When the ETL job is full, the extracted information is out there on Amazon S3.

On the Amazon S3 console, select Buckets within the navigation pane.
Select our bucket and folder containing the extracted information.
Select a file and on the Actions menu, select Question with S3 Choose to view the contents of the file.

Clear up

To keep away from incurring costs for the providers used on this walkthrough, full the next steps to delete your assets:

On the AWS Glue console, select Crawlers within the navigation pane.
Choose your crawler and on the Motion menu, select Delete crawler.
On the AWS Glue Studio console, select View jobs.
Choose the job you created and on the Actions menu, select Delete job(s).
Return to the AWS Glue console and select Tables within the navigation pane.
Choose your desk and select Delete.
Select Databases within the navigation pane.
Choose your database and select Delete.
On the Amazon VPC console, select Endpoints within the navigation pane.
Choose the PrivateLink endpoint you created and on the Actions menu, select Delete VPC endpoints.

Conclusion

On this publish, we confirmed tips on how to arrange an AWS Glue crawler to crawl over a MongoDB Atlas assortment, gathering metadata and creating desk information within the AWS Glue Information Catalog. With the Information Catalog desk, we created an ETL course of utilizing the AWS Glue Studio UI to extract information from the MongoDB Atlas assortment to an S3 bucket with out writing a single line of code.

You may do that your self by configuring an AWS Glue crawler, creating an AWS Glue ETL job with AWS Glue Studio, and launching MongoDB Atlas from a QuickStart or from MongoDB Atlas on AWS Market.

Particular because of everybody who contributed to this crawler characteristic launch: Julio Montes de Oca, Mita Gavade, and Alex Prazma.

In regards to the authors

Igor Alekseev is a Senior Associate Resolution Architect at AWS in Information and Analytics area. In his function Igor is working with strategic companions serving to them construct complicated, AWS-optimized architectures. Prior becoming a member of AWS, as a Information/Resolution Architect he applied many tasks in Huge Information area, together with a number of information lakes in Hadoop ecosystem. As a Information Engineer he was concerned in making use of AI/ML to fraud detection and workplace automation.

Sandeep Adwankar is a Senior Technical Product Supervisor at AWS. Primarily based within the California Bay Space, he works with clients across the globe to translate enterprise and technical necessities into merchandise that allow clients to enhance how they handle, safe, and entry information.