Minimize Cloud Hosting Costs with BoxOps and EKS Spot Instance Workers
Elastic Kubernetes Service Using Spot Instances
Many organizations migrate to the cloud to reduce their hardware and IT management costs. By design, cloud provider data centers are not fully utilized, with the growth outpacing customer demand. However, the cloud provider would ideally like to capitalize on all of these extra compute resources. In an effort to maximize usage, cloud providers attempt to sell this spare capacity at a reduced cost until someone willing to pay the normal rate needs those resources. On AWS, these are called “Spot Instances”.
Utilizing AWS Spot Instances with your applications can drop your hosting costs by as much as 90%.
This is a great way to reduce your monthly EC2 billing costs without locking yourself into Reserved Instances, especially when you consider the dynamic nature of orchestration platforms such as Elastic Kubernetes Service (EKS). We recently added support to BoxOps to help our managed services customers reduce compute costs by as much as 90%, by taking advantage of EC2 Spot Instances.
What exactly are Spot Instances?
Spot Instances are extra compute resources that Amazon sells at a big discount. According to Amazon:
You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and other test & development workloads.
However, Spot Instances are subject to availability and can be interrupted by AWS at anytime, with a 2 minute warning - so how do you effectively make use of this?
Amazon doesn't make it super easy for you to employ Spot for your EKS worker nodes, but fortunately, all of the necessary building blocks are there. Let's assume for this discussion that you are deploying your EKS worker nodes in an Auto Scaling Group, as this is the key starting point for follow-on discussion.
In general, the takeaway is that Spot Instances work great for containerized workloads that are stateless and fault-tolerant. This is a great fit for most applications deployed on Kubernetes.
Enter MixedInstancesPolicy for your ASG
To deploy EC2 Spot worker nodes, you will need to make use of the MixedInstancesPolicy to maximize your usage of Spot Instances. The strategy that BoxOps employs is to select similarly sized instances based on CPU. For example, a
large` MixedInstancesPolicy` would be comprised of:
- m5.large - m5a.large - m4.large - r5.large - r5a.large - r4.large - i3.large - i3en.large
All of these instance types have 2 vCPUs and at least 8 GB of RAM. Some of them have more memory, but that won't be a problem when using a Spot pool made up of any combination of those instances. It is worth noting that you could create your Spot pool based on memory requirements rather than CPU, which may make sense for some containerized workloads.
When creating your
MixedInstancesPolicy, you have the ability to configure how you want to distribute the instances - i.e. On Demand vs Spot. If you use the example above with 8 acceptable types and then configure the
SpotInstancePools to be 8 - this will allow the ASG to diversify your Spot pool across the 8 cheapest instance types available from the pool.
So let's assume you have configured your Auto Scaling Group (ASG) using the above set of instances with a distribution of 100% Spot. You're finished, right? Not exactly. What happens when one of your workers is terminated by the EC2 service? Your ASG will detect the instance has been terminated and replace it with another Spot instance, but nothing will be done gracefully without further intervention.
Fortunately, someone has done the heavy lifting for you and created a Helm chart for that - k8s-spot-termination-handler. This application runs as a
DaemonSet on your cluster and will drain the node when it has been marked for termination by EC2. The termination handler simply polls the EC2 metadata API, checks when the instance is going to be terminated, then drains and cordons the node. You can also have the handler notify the ASG to launch a replacement immediately so that a new instance comes online as the reclaimed one is being drained and terminated.
For further cluster durability (with or without Spot), BoxOps also utilizes cluster-autoscaler and incorporates a Lambda function based on amazon-k8s-node-drainer to gracefully drain nodes during rolling updates or scale in events.
We would love to help you incorporate these features or get you started with BoxOps! Contact BoxBoat today to get started.