An information lake is a centralized, curated, and secured repository that shops all of your knowledge, each in its unique kind and ready for evaluation. Establishing and managing knowledge lakes in the present day entails numerous handbook, sophisticated, and time-consuming duties. AWS Glue and AWS Lake Formation make it straightforward to construct, safe, and handle knowledge lakes. As knowledge from current knowledge shops is moved within the knowledge lake, there’s a have to catalog the info to organize it for analytics from providers comparable to Amazon Athena.
AWS Glue crawlers are a preferred approach to populate the AWS Glue Catalog. AWS Glue crawlers are a key part that mean you can connect with knowledge sources or targets, use completely different classifiers to find out the logical schema for the info, and create metadata within the Knowledge Catalog. You’ll be able to run crawlers on a schedule, on demand, or triggered primarily based on an Amazon Easy Storage Service (Amazon S3) occasion to make sure that the Knowledge Catalog is updated. Utilizing S3 occasion notifications can scale back the fee and time a crawler must replace massive and regularly altering tables.
The AWS Glue crawlers UI has been redesigned to supply a greater person expertise, and new functionalities have been added. This new UI gives simpler setup of crawlers throughout a number of sources, together with Amazon S3, Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon DocumentDB (with MongoDB compatibility), Delta Lake, MariaDB, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, and MongoDB. A brand new AWS Glue crawler historical past function has additionally been launched, which gives a handy approach to view crawler runs, their schedules, knowledge sources, and tags. For every crawl, the crawler historical past provides a abstract of knowledge modifications comparable to modifications within the database schema or Amazon S3 partition modifications. Crawler historical past additionally gives DPU hours that may scale back the time to investigate and debug crawler operations and prices.
This submit reveals easy methods to create an AWS Glue crawler that helps S3 occasion notification utilizing the brand new UI. We additionally present easy methods to navigate by means of the brand new crawler historical past part and get invaluable insights.
Overview of resolution
To display easy methods to create an AWS Glue crawler utilizing the brand new UI, we use the Toronto parking tickets dataset, particularly the info about parking tickets issued within the metropolis of Toronto between 2017–2018. The purpose is to create a crawler primarily based on S3 occasions, run it, and discover the data confirmed within the UI in regards to the run of this crawler.
As talked about earlier than, as an alternative of crawling all of the subfolders on Amazon S3, we use an S3 event-based strategy. This helps enhance the crawl time through the use of S3 occasions to determine the modifications between two crawls by itemizing all of the recordsdata from the subfolder that triggered the occasion as an alternative of itemizing the complete Amazon S3 goal. For this submit, we create an S3 occasion, Amazon Easy Storage Service (Amazon SNS) subject, and Amazon Easy Queue Service (Amazon SQS ) queue.
The next diagram illustrates our resolution structure.
Conditions
For this walkthrough, it is best to have the next stipulations:
If the AWS account you employ to observe this submit makes use of Lake Formation to handle permissions on the AWS Glue Knowledge Catalog, just remember to log in as a person with entry to create databases and tables. For extra data, consult with Implicit Lake Formation permissions.
Launch your CloudFormation stack
To create your assets for this use case, full the next steps:
- Launch your CloudFormation stack in
us-east-1
: - Below Parameters, enter a reputation on your S3 bucket (embrace your account quantity).
- Choose I acknowledge that AWS CloudFormation may create IAM assets with customized names.
- Select Create stack.
- Wait till the creation of the stack is full, as proven on the AWS CloudFormation console.
- On the stack’s Outputs tab, pay attention to the SQS queue ARN—we use it through the crawler creation course of.
Launching this stack creates AWS assets. You want the next assets from the Outputs tab for the following steps:
- GlueCrawlerRole – The IAM position to run AWS Glue jobs
- BucketName – The title of the S3 bucket to retailer solution-related recordsdata
- GlueSNSTopic – The SNS subject, which we use because the goal for the S3 occasion
- SQSArn – The SQS queue ARN; this queue goes to be consumed by the AWS Glue crawler
Create an AWS Glue crawler
Let’s first create the dataset that’s going for use because the supply of the AWS Glue crawler:
- Open AWS CloudShell.
- Run the next command:
This motion triggers an S3 occasion that sends a message to the SNS subject that you simply created utilizing the CloudFormation template. This message is consumed by an SQS queue that might be enter for the AWS Glue crawler.Now, let’s create the AWS Glue crawler.
- On the AWS Glue console, select Crawlers within the navigation pane.
- Select Create crawler.
- For Title, enter a reputation (for instance,
BlogPostCrawler
). - Select Subsequent.
- For Is your knowledge already mapped to Glue tables, choose Not but.
- Within the Knowledge sources part, select Add knowledge supply.
For this submit, you employ an S3 dataset as a supply. - For Knowledge supply, select S3.
- For Location of S3 knowledge, choose On this account.
- For S3 path, enter the trail to the S3 bucket you created with the CloudFormation template (
s3://glue-crawler-blog-YOUR ACCOUNT NUMBER/torontotickets/
). - For Subsequent crawler runs, choose Crawl primarily based on occasions.
- Enter the SQS queue ARN you created earlier.
- Select Add a S3 knowledge supply.
- Select Subsequent.
- For Current IAM position¸ select the position you created (
GlueCrawlerBlogRole
). - Select Subsequent.
Now let’s create an AWS Glue database. - Below Goal database, select Add database.
- For Title, enter
blogdb
. - For Location, select the S3 bucket created by the CloudFormation template.
- Select Create database.
- On the Set output and scheduling web page, for Goal database, select the database you simply created (
blogdb
). - For Desk title prefix, enter
weblog
. - For Most desk threshold, you’ll be able to optionally set a restrict for the variety of tables that this crawler can scan. For this submit, we depart this selection clean.
- For Frequency, select On demand.
- Select Subsequent.
- Assessment the configuration and select Create crawler.
Run the AWS Glue crawler
To run the crawler, navigate to the crawler on the AWS Glue console.
Select Run crawler.
On the Crawler runs tab, you’ll be able to see the present run of the crawler.
Discover the crawler run historical past knowledge
When the crawler is full, you’ll be able to see the next particulars:
- Period – The precise length time of the crawler run
- DPU hours – The variety of DPU hours spent through the crawler run; that is very helpful to calculate prices
- Desk modifications – The modifications utilized to the desk, like new columns or partitions
Select Desk modifications to see the crawler run abstract.
You’ll be able to see the desk blogtorontotickets
was created, and likewise a 2017
partition.
Let’s add extra knowledge to the S3 bucket to see how the crawler processes this modification.
You’ll be able to see the second run of the crawler listed.
Be aware that the DPU hours have been lowered by greater than half; it’s because just one partition was scanned and added. Having an event-based crawler helps scale back runtime and value.
You’ll be able to select the Desk modifications data of the second run to see extra particulars.
Be aware underneath Partitions added, the 2018
partition was created.
Further notes
Have in mind the next concerns:
- Crawler historical past is supported for crawls which have occurred for the reason that launch date of the crawler historical past function, and solely retains as much as 12 months of crawls. Older crawls won’t be returned.
- To arrange a crawler utilizing AWS CloudFormation, you should use following template.
- You will get all of the crawls of a specified crawler through the use of list-crawls APIs.
- You’ll be able to replace current crawlers with a single Amazon S3 goal to make use of this new function. You are able to do this both through the AWS Glue console or by calling the update_crawler API.
Clear up
To keep away from incurring future expenses, and to scrub up unused roles and insurance policies, delete the assets you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue desk.
Conclusion
You should use AWS Glue crawlers to find datasets, extract schema data, and populate the AWS Glue Knowledge Catalog. AWS Glue crawlers now present an easier-to-use UI workflow to arrange crawlers and likewise present metrics related to previous crawlers run to simplify monitoring and auditing. On this submit, we supplied a CloudFormation template to arrange AWS Glue crawlers to make use of S3 occasion notifications, which reduces the time and value wanted to incrementally course of desk knowledge updates within the AWS Glue Knowledge Catalog. We additionally confirmed you easy methods to monitor and perceive the price of crawlers.
Particular because of everybody who contributed to the crawler historical past launch: Theo Xu, Jessica Cheng and Joseph Barlan.
Pleased crawling!
In regards to the authors
Leonardo Gómez is a Senior Analytics Specialist Options Architect at AWS. Based mostly in Toronto, Canada, He has over a decade of expertise in knowledge administration, serving to prospects across the globe tackle their enterprise and technical wants. Join with him on LinkedIn.
Sandeep Adwankar is a Senior Technical Product Supervisor at AWS. Based mostly within the California Bay Space, he works with prospects across the globe to translate enterprise and technical necessities into merchandise that allow prospects to enhance how they handle, safe, and entry knowledge.