In this article, I aim to tell a little about how we are using Spot by NetApp. It is important to note that Spot has a lot of products aimed at the cloud. I want to analyze only Elasticgroup, Spot’s product for managing EC2 instances, similar to AutoScalingGroup (ASG) from AWS.
NOTE: Whenever I refer to “Spot” with a capital initial, I am referring to the company SpotByNetapp and whenever I refer to “spot” with a lowercase initial, I am referring to the AWS Spot instance type
At Convenia, we already used Spot Instances for any service that was not critical and could admit interruptions. In the month of black friday, we exclusively used “On Demand” instances to avoid massive interruptions due to the exhaustion of the spot offer, this was the result:
The graph above shows a cost spike in December caused by “On Demand” instances. After analyzing the cost, we held a post-mortem meeting and decided to evaluate Spot’s Elasticgroup, which promises a fallback to “on demand” if the spot supply runs out.
At first, I can say that we liked the result, and we passed all the instances to be managed by the Elasticgroup.
How was the integration?
Whenever we need to test anything inside Convenia, we do a POC with some internal service to understand the negative and positive points, and that’s what we did.
The first step for integration is to link the Spot account with your AWS account following the step-by-step in Spot’s documentation. Despite being well explained, this step contains some steps in case there is any doubt about the implementation of the possibilities that Spot can bring. I suggest a conversation with the people at RealCloud, who helped us understand Spot’s products and how to make this migration.
After linking the AWS account with Spot, we follow a wizard that imports the AutoScalingGroup settings from AWS to an Elasticgroup within the spot. That was all it took to have the instances managed by Spot. After creating the Elastic group, we could inactivate our AutoScalingGroup (setting the desired minimum and maximum capacity to 0) or even remove it.
In our case, making this migration to the first service took about an hour. We kept monitoring to see how Elasticgroup would do the instance swaps to get more savings, and we did some additional tests to force refresh instances and interrupts to understand how it handles it in our experience. We didn’t even have a surprise, it doesn’t lose anything for the AWS ASG and still has some more features as I will talk about later.
Another important point worth mentioning during the integration is that RealCloud accompanied us throughout the entire process, and billing only started to run after the integration of all services.
What are the advantages of adopting Spot?
The logical answer to this question would be the cost, but I’ll start by talking about things that have nothing to do with cost, the cost we’ll analyze more deeply in the next topic, so here are the other advantages:
Fallback to “On Demand”: This is one of the main drivers for adoption in my opinion. It’s difficult to predict spot supply efficiently. AWS promises to give outages 2 minutes in advance, so you can build automation to deal with the interruption, but I don’t particularly like to handle it handcrafted. What we ended up doing, is keeping the most critical instances with “On Demand”. Being able to trust that Elasticgroup will fall back correctly allows you to spot even the most critical instances and prevents you from doing the madness of returning all instances to “On Demand” on the eve of a Black Friday, causing that cost spike shown at the beginning.
Below are the Elasticgroup logs replacing spot pot instances “On Demand”:
Scheduled scaling: Elasticgroup allows us to define some schedulers in cron format to schedule machines at certain times. At Convenia, we leave at least two instances running for the main services during the day, and as it is a system very focused on the Brazilian public, traffic drops a lot during the night, in which case we can configure this scheduler to turn off the redundant instances at dawn, bringing an even bigger saving
Simplicity in the organization of resources: To maintain CodeReview practices and maintain a minimum standardization of services, we use Terraform. Generally, when creating autoscaling within AWS, we end up creating a series of separate resources, such as scaling policies, and ASG attachments, among others. By adding the Elasticgroup resource to our terraform, we noticed a good simplification since the Elasticgroup is a single recursion that gathers all the functionalities, in the end, we exchanged several terraform resources for just one.
Below is a real example of an Elasticgroup created by Terraform at Convenia:
Cost over saving: Perhaps this is one of the main attractions in the sale of services by Spot. Spot’s billing model is about your saving, so there is no possibility of you paying for something you won’t get a guaranteed return! If your billing doesn’t decrease, you pay nothing.
These were some of the advantages we noticed in this migration, but I confess that Convenia does not use all the power of Spot. There is still the possibility to spot stateful instances among other things, regardless of your case, I find it hard to believe that Spot does not have a solution that can bring some big savings to your system (in machine numbers), the more this adoption makes sense.
But what about the Cost?
The main motivator for Spot adoption is cost reduction. Convenia has a relatively small infrastructure and to properly quantify this saving, we need to pay attention to some details.
As shown in the billing chart above, which contains only “On Demand and Spot” costs, in September 2022 and April 2023, the cost remains more or less at the same level, which is good, because in these last months, our infrastructure has grown a little, spotting the critical instances that were previously not spot allowed us this result.
The spot has a cost of saving: our spot account varies around 130 Bidens. When we add this cost to the monthly amount shown in the aws billing, we notice that the April 2023 bill slightly exceeds the September 2022 bill, this is due to the growth of our infrastructure in these months, but even without this growth, I do not believe that the savings would have been expressive to the point of justifying the adoption of Spot just by cost. It is important to note that the larger your infrastructure, the greater the justification for adoption, perhaps Convenia did not have such expressive results due to the size of the infrastructure and the fact that it already uses Spot Instances for some services, of course, the savings could be even greater if we spot the stateful instances we have.
Our total discount due to spot instances was around 65%. The Spot rate is 28% on this saving of 65, but nobody counts that for a Brazilian company using Spot, we will have some taxes added to the rate from Spot that can scratch the 40% of that saving.
Another important detail that is forsaken is the cost of labor, even though the integration is simple, massively changing ASGs to Elasticgroup requires minimal validation, in addition to a follow-up to see if it is working as expected and if it is reaching the target predicted saving, in the end, the team will die in a few days on top of this integration, and labor is expensive!
In the end, analyzing the advantages and disadvantages, I consider Spot use positive. If I had to judge only the reduction in computing costs, it would not make so much sense, but having an automatic, simple, and reliable fallback to “on demand” is the biggest advantage since we won’t see cost spikes in the black friday seasons, which would already offset the price paid for the spot service.
This cost conclusion is very particular to Convenia, which has a relatively small infrastructure and was already well-spritzed before. As the company grows, I believe that the savings provided becomes increasingly relevant. At the 2022 AWS Summit, Nubank showed a little of their case of using Spot with their volume, spotting can bring an absurd amount of costs. For those who want to see this case closely, check it out here on YouTube.
*The content of this article is the author’s responsibility and does not necessarily reflect the opinion of iMasters.