The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that’s 100% API suitable with open-source Apache Spark. With Amazon EMR launch 6.9.0, the EMR runtime for Apache Spark helps equal Spark model 3.3.0.
With Amazon EMR 6.9.0, now you can run your Apache Spark 3.x purposes sooner and at decrease value with out requiring any modifications to your purposes. In our efficiency benchmark exams, derived from TPC-DS efficiency exams at 3 TB scale, we discovered the EMR runtime for Apache Spark 3.3.0 offers a 3.5 occasions (utilizing complete runtime) efficiency enchancment on common over open-source Apache Spark 3.3.0.
On this put up, we analyze the outcomes from our benchmark exams working a TPC-DS utility on open-source Apache Spark after which on Amazon EMR 6.9, which comes with an optimized Spark runtime that’s suitable with open-source Spark. We stroll by way of an in depth value evaluation and eventually present step-by-step directions to run the benchmark.
Outcomes noticed
To guage the efficiency enhancements, we used an open-source Spark efficiency take a look at utility that’s derived from the TPC-DS efficiency take a look at toolkit. We ran the exams on a seven-node (six core nodes and one major node) c5d.9xlarge EMR cluster with the EMR runtime for Apache Spark, and a second seven-node self-managed cluster on Amazon Elastic Compute Cloud (Amazon EC2) with the equal open-source model of Spark. We ran each the exams with information in Amazon Easy Storage Service (Amazon S3).
Dynamic Useful resource Allocation (DRA) is a good function to make use of for various workloads. Nonetheless, for a benchmarking train the place we examine two platforms purely on efficiency, and take a look at information volumes don’t change (3 TB in our case), we consider it’s greatest to keep away from variability in an effort to run an apples-to-apples comparability. In our exams in each open-source Spark and Amazon EMR, we disabled DRA whereas working the benchmarking utility.
The next desk reveals the whole job runtime for all queries (in seconds) within the 3 TB question dataset between Amazon EMR model 6.9.0 and open-source Spark model 3.3.0. We noticed that our TPC-DS exams had a complete job runtime on Amazon EMR on Amazon EC2 that was 3.5 occasions sooner than that utilizing an open-source Spark cluster of the identical configuration.

The per-query speedup on Amazon EMR 6.9 with and with out the EMR runtime for Apache Spark is illustrated within the following chart. The horizontal axis reveals every question within the 3 TB benchmark. The vertical axis reveals the speedup of every question because of the EMR runtime. Notable efficiency beneficial properties are over 10 occasions sooner for TPC-DS queries 24b, 72, 95, and 96.
Price evaluation
The efficiency enhancements of the EMR runtime for Apache Spark instantly translate to decrease prices. We have been capable of understand a 67% value financial savings working the benchmark utility on Amazon EMR compared with the associated fee incurred to run the identical utility on open-source Spark on Amazon EC2 with the identical cluster sizing resulting from diminished hours of Amazon EMR and Amazon EC2 utilization. Amazon EMR pricing is for EMR purposes working on EMR clusters with EC2 situations. The Amazon EMR worth is added to the underlying compute and storage costs corresponding to EC2 occasion worth and Amazon Elastic Block Retailer (Amazon EBS) value (if attaching EBS volumes). General, the estimated benchmark value within the US East (N. Virginia) Area is $27.01 per run for the open-source Spark on Amazon EC2 and $8.82 per run for Amazon EMR.
| Benchmark Job | Runtime (Hour) | Estimated Price | Complete EC2 Occasion | Complete vCPU | Complete Reminiscence (GiB) | Root Gadget (Amazon EBS) |
|
Open-source Spark on Amazon EC2 (1 major and 6 core nodes) |
2.23 | $27.01 | 7 | 252 | 504 | 20 GiB gp2 |
|
Amazon EMR on Amazon EC2 (1 major and 6 core nodes) |
0.63 | $8.82 | 7 | 252 | 504 | 20 GiB gp2 |
Price breakdown
The next is the associated fee breakdown for the open-source Spark on Amazon EC2 job ($27.01):
- Complete Amazon EC2 value – (7 * $1.728 * 2.23) = (variety of situations * c5d.9xlarge hourly price * job runtime in hour) = $26.97
- Amazon EBS value – ($0.1/730 * 20 * 7 * 2.23) = (Amazon EBS per GB-hourly price * root EBS measurement * variety of situations * job runtime in hour) = $0.042
The next is the associated fee breakdown for the Amazon EMR on Amazon EC2 job ($8.82):
- Complete Amazon EMR value – (7 * $0.27 * 0.63) = ((variety of core nodes + variety of major nodes)* c5d.9xlarge Amazon EMR worth * job runtime in hour) = $1.19
- Complete Amazon EC2 value – (7 * $1.728 * 0.63) = ((variety of core nodes + variety of major nodes)* c5d.9xlarge occasion worth * job runtime in hour) = $7.62
- Amazon EBS value – ($0.1/730 * 20 GiB * 7 * 0.63) = (Amazon EBS per GB-hourly price * EBS measurement * variety of situations * job runtime in hour) = $0.012
Arrange OSS Spark benchmarking
Within the following sections, we offer a short define of the steps concerned in establishing the benchmarking. For detailed directions with examples, discuss with the GitHub repo.
For our OSS Spark benchmarking, we use the open-source instrument Flintrock to launch our Amazon EC2-based Apache Spark cluster. Flintrock offers a fast strategy to launch an Apache Spark cluster on Amazon EC2 utilizing the command line.
Conditions
Full the next prerequisite steps:
- Have Python 3.7.x or above.
- Have Pip3 22.2.2 or above.
- Add the Python bin listing to your setting path. The Flintrock binary might be put in on this path.
- Run
aws configureto configure your AWS Command Line Interface (AWS CLI) shell to level to the benchmarking account. Check with Fast configuration with aws configure for directions. - Have a key pair with restrictive file permissions to entry the OSS Spark major node.
- Create a brand new S3 bucket in your take a look at account if wanted.
- Copy the TPC-DS supply information as enter to your S3 bucket.
- Construct the benchmark utility following the steps supplied in Steps to construct spark-benchmark-assembly utility. Alternatively, you may obtain a pre-built spark-benchmark-assembly-3.3.0.jar in order for you a Spark 3.3.0-based utility.
Deploy the Spark cluster and run the benchmark job
Full the next steps:
- Set up the Flintrock instrument by way of pip as proven in Steps to setup OSS Spark Benchmarking.
- Run the command flintrock configure, which pops up a default configuration file.
- Modify the default
config.yamlfile based mostly in your wants. Alternatively, copy and paste the config.yaml file content material to the default configure file. Then save the file to the place it was. - Lastly, launch the 7-node Spark cluster on Amazon EC2 by way of Flintrock.
This could create a Spark cluster with one major node and 6 employee nodes. When you see any error messages, double-check the config file values, particularly the Spark and Hadoop variations and the attributes of download-source and the AMI.
The OSS Spark cluster doesn’t include YARN useful resource supervisor. To allow it, we have to configure the cluster.
- Obtain the yarn-site.xml and enable-yarn.sh recordsdata from the GitHub repo.
- Exchange <personal ip of major node> with the IP tackle of the first node in your Flintrock cluster.
You possibly can retrieve the IP tackle from the Amazon EC2 console.
- Add the recordsdata to all of the nodes of the Spark cluster.
- Run the enable-yarn script.
- Allow Snappy help in Hadoop (the benchmark job reads Snappy compressed information).
- Obtain the benchmark utility utility JAR file spark-benchmark-assembly-3.3.0.jar to your native machine.
- Copy this file to the cluster.
- Log in to the first node and begin YARN.
- Submit the benchmark job on the open-source Spark cluster as proven in Submit the benchmark job.
Summarize the outcomes
Obtain the take a look at end result file from the output S3 bucket s3://$YOUR_S3_BUCKET/EC2_TPCDS-TEST-3T-RESULT/timestamp=xxxx/abstract.csv/xxx.csv. (Exchange $YOUR_S3_BUCKET along with your S3 bucket identify.) You need to use the Amazon S3 console and navigate to the output S3 location or use the AWS CLI.
The Spark benchmark utility creates a timestamp folder and writes a abstract file inside a abstract.csv prefix. Your timestamp and file identify might be completely different from the one proven within the previous instance.
The output CSV recordsdata have 4 columns with out header names. They’re:
- Question identify
- Median time
- Minimal time
- Most time
The next screenshot reveals a pattern output. We now have manually added column names. The way in which we calculate the geomean and the whole job runtime relies on arithmetic means. We first take the imply of the med, min, and max values utilizing the system AVERAGE(B2:D2). Then we take a geometrical imply of the Avg column utilizing the system GEOMEAN(E2:E105).

Arrange Amazon EMR benchmarking
For detailed directions, see Steps to setup EMR Benchmarking.
Conditions
Full the next prerequisite steps:
- Run
aws configureto configure your AWS CLI shell to level to the benchmarking account. Check with Fast configuration with aws configure for directions. - Add the benchmark utility to Amazon S3.
Deploy the EMR cluster and run the benchmark job
Full the next steps:
- Spin up Amazon EMR in your AWS CLI shell utilizing command line as proven in Deploy EMR Cluster and run benchmark job.
- Configure Amazon EMR with one major (c5d.9xlarge) and 6 core (c5d.9xlarge) nodes. Check with create-cluster for an in depth description of AWS CLI choices.
- Retailer the cluster ID from the response. You want this within the subsequent step.
- Submit the benchmark job in Amazon EMR utilizing add-steps within the AWS CLI.
Summarize the outcomes
Summarize the outcomes from the output bucket s3://$YOUR_S3_BUCKET/weblog/EMRONEC2_TPCDS-TEST-3T-RESULT in the identical method as we did for the OSS outcomes and examine.
Clear up
To keep away from incurring future expenses, delete the sources you created utilizing the directions within the Cleanup part of the GitHub repo.
- Cease the EMR and OSS Spark clusters. You might also delete them in the event you don’t need to retain the content material. You possibly can delete these sources by working the script cleanup-benchmark-env.sh from a terminal in your benchmark setting.
- When you used AWS Cloud9 as your IDE for constructing the benchmark utility JAR file utilizing Steps to construct spark-benchmark-assembly utility, it’s possible you’ll need to delete the setting as properly.
Conclusion
You possibly can run your Apache Spark workloads 3.5 occasions (based mostly on complete runtime) sooner and at decrease value with out making any modifications to your purposes through the use of Amazon EMR 6.9.0.
To maintain updated, subscribe to the Massive Knowledge Weblog’s RSS feed to be taught extra concerning the EMR runtime for Apache Spark, configuration greatest practices, and tuning recommendation.
For previous benchmark exams, see Run Apache Spark 3.0 workloads 1.7 occasions sooner with Amazon EMR runtime for Apache Spark. Be aware that the previous benchmark results of 1.7 occasions efficiency was based mostly on geometric imply. Based mostly on geometric imply, the efficiency in Amazon EMR 6.9 was two occasions sooner.
In regards to the authors
Sekar Srinivasan is a Sr. Specialist Options Architect at AWS centered on Massive Knowledge and Analytics. Sekar has over 20 years of expertise working with information. He’s enthusiastic about serving to clients construct scalable options modernizing their structure and producing insights from their information. In his spare time he likes to work on non-profit initiatives, particularly these centered on underprivileged Youngsters’s schooling.
Prabu Ravichandran is a Senior Knowledge Architect with Amazon Net Companies, focussed on Analytics, information Lake structure and implementation. He helps clients architect and construct scalable and strong options utilizing AWS companies. In his free time, Prabu enjoys touring and spending time with household.
