Big Data

Crawl Delta Lake tables utilizing AWS Glue crawlers

Crawl Delta Lake tables utilizing AWS Glue crawlers
Written by admin


In current evolution in knowledge lake applied sciences, it turned standard to convey ACID (atomicity, consistency, isolation, and sturdiness) transactions on Amazon Easy Storage Service (Amazon S3). You possibly can obtain that by introducing open-source knowledge lake codecs reminiscent of Apache Hudi, Apache Iceberg, and Delta Lake. Delta Lake is likely one of the widespread open-source knowledge lake codecs.

Delta Lake is accessible with a number of AWS companies, reminiscent of AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. To make the most of Delta Lake from Glue Spark jobs, comply with this weblog collection. To make the most of Delta Lake from Athena and Redshift Spectrum, it’s essential to have particular desk definitions on the AWS Glue Information Catalog, and there’s an additional step to make it queryable from Athena and Redshift Spectrum.

One of many key capabilities of Delta Lake and different knowledge lake codecs is studying constant snapshot utilizing ACID transactions. Even when there are a lot of concurrent writes, you’ll be able to assure constant model of the tables on the particular cut-off date with out retrieving intermediate or incomplete state. It’s extremely demanded functionality particularly in advanced knowledge pipelines.

AWS Glue crawlers are designed to populate desk definitions on the Information Catalog based mostly on knowledge dynamically. This 12 months, AWS Glue crawlers began supporting Delta Lake. It simplifies these use circumstances by creating desk definitions of Delta tables dynamically, populating the metadata from the Delta Lake transaction logs, and creating the manifest information in Amazon S3 for Athena and Redshift Spectrum to eat. With Delta lake crawler, you’ll be able to simply learn constant snapshot from Athena and Redshift Spectrum. AWS Glue crawler integration with Delta Lake additionally helps AWS Lake Formation entry management. You possibly can grant Lake Formation permissions on the Delta tables created by the crawler to AWS principals that then question by way of Athena and Redshift Spectrum to entry knowledge in Delta tables.

This publish demonstrates how AWS Glue crawlers work with Delta tables, and describes typical use circumstances to question Delta tables.

How AWS Glue Crawler works with Delta Lake

Delta Lake gives an abstraction generally known as a Delta desk that encapsulates all metadata, knowledge information, and partitions beneath a transaction log. Delta Lake shops the metadata and schema inside the distributed file system somewhat than in a central knowledge catalog.

To entry knowledge utilizing the Delta Lake protocol, Redshift Spectrum and Athena want a manifest file that lists all information which are related to a specific Delta desk, together with the desk metadata populated within the AWS Glue Information Catalog. Historically, this manifest file creation required working a GENERATE symlink_format_manifest question on Apache Spark.

The AWS Glue crawler populates the metadata from the Delta Lake transaction log into the Information Catalog, and creates the manifest information in Amazon S3 for various question engines to eat. To simplify entry to Delta tables, the crawler gives an choice to pick a Delta Lake knowledge retailer, which encapsulates all parameters which are required for crawling. For every Delta Lake knowledge retailer, the crawler scans the Delta desk’s transaction log to detect metadata. It populates the _symlink_manifest folder with the manifest information which are partitioned by the partition keys, based mostly on configuration parameters that you just select.

Crawl Delta Lake tables utilizing AWS Glue Crawler

On this tutorial, let’s undergo crawl delta tables utilizing AWS Glue Crawler.

Conditions

Full the next prerequisite steps for this tutorial:

  1. Set up and configure the AWS Command Line Interface (AWS CLI).
  2. Create your S3 bucket if you happen to don’t have one.
  3. Create an AWS Id and Entry Administration (IAM) function on your AWS Glue crawler if you happen to don’t have one. For directions, consult with Create an IAM function for AWS Glue.
  4. Run the next command to repeat the pattern Delta desk into your S3 bucket (exchange your_s3_bucket along with your S3 bucket identify):
$ aws s3 sync s3://aws-bigdata-blog/artifacts/delta-lake-crawler/sample_delta_table/ s3://your_s3_bucket/knowledge/sample_delta_table/

Create a Delta Lake crawler

You possibly can create a Delta Lake crawler by way of the AWS Glue console, the AWS Glue SDK, or the AWS CLI. Within the SDK, specify a DeltaTarget with the next configurations:

  • DeltaTables – An inventory of Amazon S3 DeltaPath values the place the Delta tables are positioned. (Notice that every path should be the father or mother of a _delta_log folder).
  • WriteManifest – A Boolean worth indicating whether or not or not the crawler ought to write the manifest information for every DeltaPath.
  • ConnectionName – An elective connection identify saved within the Information Catalog that the crawler ought to use to entry Delta tables backed by a VPC.

To create your crawler on the AWS Glue console, full the next steps:

  1. On the AWS Glue console, select Crawlers within the navigation pane.
  2. Select Create crawler.
  3. For Identify, enter delta-lake-crawler, and select Subsequent.
  4. For Information supply configuration, selectNot but.
  5. For Information supply, select Add an information supply.
  6. For Information supply, choose Delta Lake.
  7. For Embrace delta lake desk paths, enter s3://your_s3_bucket/knowledge/sample_delta_table/.
  8. Choose Allow write manifest, then select Add a Delta Lake knowledge supply. Select Subsequent.
  9. For IAM function, beneath Present IAM function, select your IAM function, then select Subsequent.
  10. For Goal database, select Add database, then Create a database web page is proven up.
    For Identify, enter delta_lake, then select Create database. Then come again to the earlier web page. For Goal database, click on the reload button, and choose delta_lake database.
  11. For Frequency beneath Crawler schedule, select On demand, then select Subsequent.
  12. Overview your configuration, and select Create crawler. You possibly can set off the crawler to run manually by way of the AWS Glue console, or by way of the SDK or AWS CLI utilizing the StartCrawl API. You can additionally schedule a set off by way of the AWS Glue console. For this publish, we run the crawler by way of the AWS Glue console.
  13. Choose delta-lake-crawler, and select Run.
  14. Watch for the crawler to finish.

After the crawler runs, it writes a single manifest desk within the Information Catalog for every DeltaPath beneath its configuration that has a legitimate Delta desk. The manifest desk makes use of the format SymlinkTextInputFormat and the manifest location s3://your_s3_bucket/knowledge/sample_delta_table/_symlink_format_manifest/.

You possibly can see the Delta desk definition on the AWS Glue console. The desk location factors to the previous manifest location.

The desk definition additionally consists of an AdditionalLocations subject, which is an array that factors to the situation s3://your_s3_bucket/knowledge/sample_delta_table/. You possibly can entry this extra subject by way of the next AWS CLI command:

$ aws glue get-table --database delta_lake --name sample_delta_table
{
    "Desk": {
        "Identify": "sample_delta_table",
        "DatabaseName": "delta_lake",
        "Proprietor": "proprietor",
        "CreateTime": "2022-07-07T17:49:43+09:00",
        "UpdateTime": "2022-07-07T20:33:09+09:00",
        "LastAccessTime": "2022-07-07T17:49:43+09:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "product_id",
                    "Type": "string"
                },
                {
                    "Name": "product_name",
                    "Type": "string"
                },
                {
                    "Name": "price",
                    "Type": "bigint"
                },
                {
                    "Name": "currency",
                    "Type": "string"
                },
                {
                    "Name": "category",
                    "Type": "string"
                },
                {
                    "Name": "updated_at",
                    "Type": "double"
                }
            ],
            "Location": "s3://your_s3_bucket/knowledge/sample_delta_table/_symlink_format_manifest/",
            "AdditionalLocations": [
                "s3://your_s3_bucket/data/sample_delta_table/"
            ],
            "InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                "Parameters": {}
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {
                "CrawlerSchemaDeserializerVersion": "1.0",
                "CrawlerSchemaSerializerVersion": "1.0",
                "UPDATED_BY_CRAWLER": "delta-lake-crawler",
                "classification": "parquet"
            },
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "CrawlerSchemaDeserializerVersion": "1.0",
            "CrawlerSchemaSerializerVersion": "1.0",
            "UPDATED_BY_CRAWLER": "delta-lake-crawler",
            "classification": "parquet",
            "last_modified_by": "hadoop",
            "last_modified_time": "1657193589",
            "transient_lastDdlTime": "1657193589"
        },
        "CreatedBy": "arn:aws:sts::123456789101:assumed-role/AWSGlueServiceRole-Default/AWS-Crawler",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "123456789101",
        "IsRowFilteringEnabled": false,
        "VersionId": "1",
        "DatabaseId": "7cb01f36ebbc460eb2d4dcce2c91ed2b"
    }
}

Question Delta tables

After you create the manifest desk, AWS question engines reminiscent of Athena and Redshift Spectrum are in a position to question towards the information by studying the manifest file areas to filter which knowledge information to question within the Delta desk.

Question from Athena

Athena customers have to level their catalog to the AWS Glue Information Catalog. Open the Athena console in the identical Area as the place your desk is registered within the Information Catalog, and ensure that the info supply is about to AwsDataCatalog.

Now you’re able to run queries on Athena. To entry your Delta desk, run the next question:

SELECT * FROM "delta_lake"."sample_delta_table" restrict 10;

The next screenshot exhibits our output.

Question from Redshift Spectrum

Redshift Spectrum requires an exterior schema pointing to the database through which the Delta desk was created.

To question with Redshift Spectrum, full the next steps:

  1. Create an IAM function for an Amazon Redshift cluster with the next configuration:
    1. For permissions, use arn:aws:iam::aws:coverage/AmazonS3ReadOnlyAccess, or your customized coverage for studying your S3 bucket.
    2. Use the next belief relationship:
      {
          "Model": "2012-10-17",
          "Assertion": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "redshift.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }

    To lean extra, go to IAM insurance policies for Amazon Redshift Spectrum.

  2. Launch a brand new Redshift cluster with the next configuration:
    1. Select dc2.giant, 1 node.
    2. Configure the IAM function you created in step 1.
  3. Join with the Amazon Redshift question editor v2.For directions, see Querying a database utilizing the question editor v2.
  4. Create an exterior schema for the delta_lake database to make use of in Redshift Spectrum (exchange <your IAM function ARN> along with your IAM function ARN):
    create exterior schema spectrum from knowledge catalog 
    database 'delta_lake' 
    iam_role '<your IAM function ARN>'
    create exterior database if not exists;

  5. Run the next SQL towards spectrum.sample_delta_table:
    SELECT * FROM "dev"."spectrum"."sample_delta_table" LIMIT 10

The next screenshot exhibits our output.

Limitations of Delta crawlers and manifest tables

When the info or schema in a Delta desk is up to date, the manifest tables within the AWS Glue Information Catalog could turn out to be out of sync. It means you could nonetheless question the manifest desk and get the constant end result, however the results of the desk is on the earlier level of time. To get the most recent end result, you should replace the manifest tables utilizing the crawler or manually replace the manifest desk by way of the AWS Glue SDK or AWS CLI. While you need to maintain the manifest desk up-to-date, you’ll be able to run Delta Lake crawlers on a schedule (for instance, as soon as an hour).

When the Delta desk knowledge is up to date, the manifest information beneath the _symlink_manifest folder of your Delta tables may turn out to be out of sync, through which case it’s essential to rerun a crawler with writeManifest enabled.

Use of Delta tables in EMR and Glue Spark

The delta lake crawler is designed to be used in Athena, Redshift Spectrum, and different engines appropriate with parquet-based manifest tables. For EMR Spark or Glue Spark jobs, you don’t want to create a manifest desk by working the delta lake crawler, as an alternative, you’ll be able to learn from and write to delta desk immediately utilizing Delta Lake library. You possibly can comply with this weblog collection to know course of Delta tables on Glue Spark jobs.

Safe Delta tables utilizing Lake Formation permissions

Manifest tables created by the Delta Lake crawler help Lake Formation entry management, together with cell-level safety. It permits Information Lake directors to filter particular rows and columns for sure customers of their manifest tables. By means of the usage of CreateDataCellFilter and GrantPermissions APIs, you’ll be able to grant row and column filters to the Delta manifest desk. You possibly can question the Delta manifest desk from Athena and Redshift Spectrum with the usage of these filters configured on the Delta manifest tables.

To be taught extra about Lake Formation cell-level safety, consult with the next weblog posts:

Clear up

Now to the ultimate step, cleansing up the sources:

  • Delete the Amazon Redshift cluster.
  • Delete your knowledge beneath your S3 path: s3://your_s3_bucket/knowledge/sample_delta_table/.
  • Delete the AWS Glue crawler delta-lake-crawler.
  • Delete the AWS Glue database delta_lake.

Conclusion

This publish demonstrated crawl Delta tables utilizing an AWS Glue crawler, and question towards the crawled tables from Athena and Redshift Spectrum. With AWS Glue crawlers, the manifest information are mechanically created, so you’ll be able to simply combine Delta tables with Athena and Redshift Spectrum with out guide effort in sustaining manifest information. It additionally allows you to handle cell-level safety on the Delta tables utilizing Lake Formation permissions.

Let’s begin utilizing Glue crawlers on your personal Delta tables. When you’ve got feedback or suggestions, please be happy to depart them within the feedback.


In regards to the authors

Kyle Duong is a Software program Improvement Engineer on the AWS Glue and AWS Lake Formation crew. He’s obsessed with constructing huge knowledge applied sciences and distributed programs. In his free time, he enjoys biking or enjoying basketball.

Noritaka Sekiyama is a Principal Massive Information Architect on the AWS Glue crew. He’s accountable for constructing software program artifacts to assist clients. This summer time, he loved goldfish scooping together with his youngsters.

About the author

admin

Leave a Comment