Run fault tolerant and cost-optimized Spark clusters utilizing Amazon EMR on EKS and Amazon EC2 Spot Cases

Amazon EMR on EKS is a deployment choice in Amazon EMR that means that you can run Spark jobs on Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Elastic Compute Cloud (Amazon EC2) Spot Cases prevent as much as 90% over On-Demand Cases, and is a good way to value optimize the Spark workloads operating on Amazon EMR on EKS. As a result of Spot is an interruptible service, if we are able to transfer or reuse the intermediate shuffle recordsdata, it improves the general stability and SLA of the job. The most recent variations of Amazon EMR on EKS have built-in Spark options to allow this functionality.

On this put up, we focus on these options—Node Decommissioning and Persistent Quantity Declare (PVC) reuse—and their affect on growing the fault tolerance of Spark jobs on Amazon EMR on EKS when value optimizing utilizing EC2 Spot Cases.

Amazon EMR on EKS and Spot

EC2 Spot Cases are spare EC2 capability offered at a steep low cost of as much as 90% over On-Demand costs. Spot Cases are an important alternative for stateless and versatile workloads. The caveat with this low cost and spare capability is that Amazon EC2 can interrupt an occasion with a proactive or reactive (2-minute) warning when it wants the capability again. You’ll be able to provision compute capability in an EKS cluster utilizing Spot Cases utilizing a managed or self-managed node group and supply value optimization to your workloads.

Amazon EMR on EKS makes use of Amazon EKS to run jobs with the EMR runtime for Apache Spark, which might be value optimized by operating the Spark executors on Spot. It gives as much as 61% decrease prices and as much as 68% efficiency enchancment for Spark workloads on Amazon EKS. The Spark utility launches a driver and executors to run the computation. Spark is a semi-fault tolerant framework that’s resilient to executor loss as a consequence of an interruption and due to this fact can run on EC2 Spot. However, when the driving force is interrupted, the job fails. Therefore, we suggest operating drivers on on-demand situations. A few of the finest practices for operating Spark on Amazon EKS are relevant with Amazon EMR on EKS.

EC2 Spot situations additionally helps in value optimization by bettering the general throughput of the job. This may be achieved by auto-scaling the cluster utilizing Cluster Autoscaler (for managed nodegroups) or Karpenter.

Although Spark executors are resilient to Spot interruptions, the shuffle recordsdata and RDD knowledge is misplaced when the executor will get killed. The misplaced shuffle recordsdata have to be recomputed, which will increase the general runtime of the job. Apache Spark has launched two options (in variations 3.1 and three.2) that addresses this difficulty. Amazon EMR on EKS launched options equivalent to node decommissioning (model 6.3) and PVC reuse (model 6.8) to simplify restoration and reuse shuffle recordsdata, which will increase the general resiliency of your utility.

Node decommissioning

The node decommissioning characteristic works by stopping scheduling of latest jobs on the nodes which can be to be decommissioned. It additionally strikes any shuffle recordsdata or cache current in these nodes to different executors (friends). If there are not any different accessible executors, the shuffle recordsdata and cache are moved to a distant fallback storage.

Fig 1 : Node Decommissioning

Let’s take a look at the decommission steps in additional element.

If one of many nodes that’s operating executors is interrupted, the executor begins the method of decommissioning and sends the message to the driving force:

21/05/05 17:41:41 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Obtained executor 7 decommissioned message
21/05/05 17:41:41 DEBUG TaskSetManager: Legitimate locality ranges for TaskSet 2.0: NO_PREF, ANY
21/05/05 17:41:41 INFO KubernetesClusterSchedulerBackend: Decommission executors: 7
21/05/05 17:41:41 DEBUG TaskSchedulerImpl: parentName: , identify: TaskSet_2.0, runningTasks: 10
21/05/05 17:41:41 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(7, 192.168.82.107, 39007, None)) as being decommissioning.

21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Will exit when completed decommissioning
21/05/05 20:22:17 INFO BlockManager: Beginning block supervisor decommissioning course of...
21/05/05 20:22:17 DEBUG FileSystem: In search of FS supporting s3a

The executor appears to be like for RDD or shuffle recordsdata and tries to duplicate or migrate these recordsdata. It first tries to discover a peer executor. If profitable, it’s going to transfer the recordsdata to the peer executor:

22/06/07 20:41:38 INFO ShuffleStatus: Updating map output for 46 to BlockManagerId(4, 192.168.13.235, 34737, None)
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Obtained shuffle knowledge block replace for 0 46, ignore.
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Obtained shuffle index block replace for 0 46, updating.

Nonetheless, if It’s not capable of finding a peer executor, it’s going to attempt to transfer the recordsdata to a fallback storage if accessible.

Fig 2: Fallback Storage

The executor is then decommissioned. When a brand new executor comes up, the shuffle recordsdata are reused:

22/06/07 20:42:50 INFO BasicExecutorFeatureStep: Including decommission script to lifecycle
22/06/07 20:42:50 DEBUG ExecutorPodsAllocator: Requested executor with id 19 from Kubernetes.
22/06/07 20:42:50 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-bfd0a5813fd1b80f-exec-19, motion ADDED
22/06/07 20:42:50 DEBUG BlockManagerMasterEndpoint: Obtained shuffle index block replace for 0 52, updating.
22/06/07 20:42:50 INFO ShuffleStatus: Get better 52 BlockManagerId(fallback, distant, 7337, None)

The important thing benefit of this course of is that it permits migrates blocks and shuffle knowledge, thereby decreasing recomputation, which provides to the general resiliency of the system and reduces runtime. This course of might be triggered by a Spot interruption sign (Sigterm) and node draining. Node draining might occur as a consequence of high-priority activity scheduling or independently.

Once you use Amazon EMR on EKS with managed node teams/Karpenter, the Spot interruption dealing with is automated, whereby Amazon EKS gracefully drains and rebalances the Spot nodes to attenuate utility disruption when a Spot node is at elevated danger of interruption. Should you’re utilizing managed node teams/Karpenter, the decommission will get triggered when the nodes are getting drained and since it’s proactive, it provides you extra time (a minimum of 2 minutes) to maneuver the recordsdata. Within the case of self-managed node teams, we suggest putting in the AWS Node Termination Handler to deal with the interruption, and the decommission is triggered when the reactive (2-minute) notification is acquired. We suggest to make use of Karpenter with Spot Cases because it has quicker node scheduling with early pod binding and binpacking to optimize the useful resource utilization.

The next code permits this configuration; extra particulars can be found on GitHub:

"spark.decommission.enabled": "true"
"spark.storage.decommission.rddBlocks.enabled": "true"
"spark.storage.decommission.shuffleBlocks.enabled" : "true"
"spark.storage.decommission.enabled": "true"
"spark.storage.decommission.fallbackStorage.path": "s3://<<bucket>>"

PVC reuse

Apache Spark enabled dynamic PVC in model 3.1, which is beneficial with dynamic allocation as a result of we don’t must pre-create the claims or volumes for the executors and delete them after completion. PVC permits true decoupling of knowledge and processing after we’re operating Spark jobs on Kubernetes, as a result of we are able to use it as a neighborhood storage to spill in-process recordsdata too. The most recent model of Amazon EMR 6.8 has built-in the PVC reuse characteristic of Spark, whereby if an executor is terminated as a consequence of EC2 Spot interruption or another purpose (JVM), then the PVC will not be deleted however continued and reattached to a different executor. If there are shuffle recordsdata in that quantity, then they’re reused.

As with node decommission, this reduces the general runtime as a result of we don’t must recompute the shuffle recordsdata. We additionally save the time required to request a brand new quantity for an executor, and shuffle recordsdata might be reused with out shifting the recordsdata spherical.

The next diagram illustrates this workflow.

Fig 3: PVC Reuse

Let’s take a look at the steps in additional element.

If a number of of the nodes which can be operating executors is interrupted, the underlying pods get terminated and the driving force will get the replace. Notice that the driving force is the proprietor of the PVC of the executors, and they don’t seem to be terminated. See the next code:

22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, motion DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, motion MODIFIED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, motion DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, motion MODIFIED

The ExecutorPodsAllocator tries to allocate new executor pods to switch those terminated as a consequence of interruption. Throughout the allocation, it figures out how lots of the current PVCs have recordsdata and might be reused:

22/06/15 23:25:23 INFO ExecutorPodsAllocator: Discovered 2 reusable PVCs from 10 PVCs

The ExecutorPodsAllocator requests for a pod and when it launches it, the PVC is reused. Within the following instance, the PVC from executor 6 is reused for brand spanking new executor pod 11:

22/06/15 23:25:23 DEBUG ExecutorPodsAllocator: Requested executor with id 11 from Kubernetes.
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, motion ADDED
22/06/15 23:25:24 INFO KubernetesClientUtils: Spark configuration recordsdata loaded from Some(/usr/lib/spark/conf) : log4j.properties,spark-env.sh,hive-site.xml,metrics.properties
22/06/15 23:25:24 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Obtained executor pod replace for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, motion MODIFIED
22/06/15 23:25:24 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim amazon-reviews-word-count-9ee82b8169a75183-exec-6-pvc-0

The shuffle recordsdata, if current within the PVC are reused.

The important thing benefit of this system is that it permits us to reuse pre-computed shuffle recordsdata of their unique location, thereby decreasing the time of the general job run.

This works for each static and dynamic PVCs. Amazon EKS provides three completely different storage choices, which might be encrypted too: Amazon Elastic Block Retailer (Amazon EBS), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. We suggest utilizing dynamic PVCs with Amazon EBS as a result of with static PVCs, you would wish to create a number of PVCs.

The next code permits this configuration; extra particulars can be found on GitHub:

"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true"

For this to work, we have to allow PVC with Amazon EKS and point out the small print within the Spark runtime configuration. For directions, consult with How do I take advantage of persistent storage in Amazon EKS? The next code accommodates the Spark configuration particulars for utilizing PVC as native storage; different particulars can be found on GitHub:

"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.choices.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.choices.storageClass": "spark-sc"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.choices.sizeLimit": "10Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/var/knowledge/spill"

Conclusion

With Amazon EMR on EKS (6.9) and the options mentioned on this put up, you possibly can additional cut back the general runtime for Spark jobs when operating with Spot Cases. This additionally improves the general resiliency and adaptability of the job whereas value optimizing the workload on EC2 Spot.

Check out the EMR on EKS workshop for improved efficiency when operating Spark workloads on Kubernetes and price optimize utilizing EC2 Spot Cases.

Concerning the Creator

Kinnar Kumar Sen is a Sr. Options Architect at Amazon Internet Providers (AWS) specializing in Versatile Compute. As part of the EC2 Versatile Compute staff, he works with clients to information them to probably the most elastic and environment friendly compute choices which can be appropriate for his or her workload operating on AWS. Kinnar has greater than 15 years of trade expertise working in analysis, consultancy, engineering, and structure.