Finops: Start from the Scratch

Back then, in 2023, I remembered my Boss, Tjiu, asked me to reduce the cost of our daily infrastructure. In the booming startup era, nobody cared about the cost, especially the infrastructure cost. We have just focused on growth, and sometimes that means deploying something without fully understanding the resources required. Not only that, people only thought about as long as my system is working fine and latency is good, then the others are not my problem. I would say, then, that’s where the problem began.

Nevertheless, what is FinOps?

FinOps is an operational framework and cultural practice which maximizes the business value of cloud and technology, enables timely data-driven decision making, and creates financial accountability through collaboration between engineering, finance, and business teams – FinOps Foundation

Btw, the terms of finops itself, I just knew it after working for cost optimisation for almost 6 months lol. So if you’re doing cost optimisation, then more or less, you’re doing finops. But forget it, the company doesn’t care about the terms; they only care that we just do it, no matter what method we use, as long as the company’s goals are achieved. However, if you want to share the experience, better if you know the terms so it makes people understand better.

Back again to me, when I received the project, I didn’t know how do I start, but I knew something, that our microservices were too many, and sometimes they’re not well designed. I will not discuss too detail about the microservices design, but I want to ensure that if you want to do cost optimisation, one thing you need to do is to inventory your services. You should have the sheet that is maintained day by day, month by month or year by year. Some of us may think too much, however, when starting the cost optimisation, you must understand how many resources you use on a daily basis first. If you have inventoried all of the services, then you can start wondering, do I need that many services to ensure the business can sustain?

When I started inventorizing the sheet, I just knew we had hundreds of services, including backend and frontend. Not only that, some of the services were underutilised. What does it mean? It means the service uses very low CPU and memory when it’s running. The first time I heard about this, when one of the VPs in Tiket mentioned:

in Google, most of the service run in very efficient way which almost takes 60% of the CPU resource, I didn’t understand why most of the services in tiket run in very such low utilisation.

At that time, the other VP argued, the way we did in Google and Tiket was pretty different, and if we’re using it at the same level, our service may not have enough time to scale, which could lead the service failure. Once again, I just overheard them since I was still a manager during that time. But point taken, I think if it’s already proven in Google, why can’t we do the same in Google, then the exercise was started.

First step, I inventorized the spec and the minimum VMs or Pods / maximum VMs or Pods/threshold that trigger the autoscale. Usually, the threshold is referred to the CPU, but on several occasions it might be memory or something else. Once again, we don’t discuss the details here. So the information is more or less like:

custom-1-2048 2/10/30%

How to read that?

It means the spec of the service will be 1 core and 2 GB RAM. The minimum VMs or pods will be 2, and the maximum is 10. The CPU threshold will be 30%, which means that if the CPU reaches that number, the autoscale will be triggered. During the exercise, I tried step by step, since most of the services ran in very low utilisation (less than 5%), I tried to reach the first milestone, which was 10%. In order to do that, I need to reduce the spec first. You need to understand, if you still use VM for deployment (at least for GCP, as in Tiket, we use GCP), then it’s not possible to go less than a single core. There is a specific machine to go below that, but the core is shared; it’s not wise to use that in production. So if you want to do cost optimisation and you have multiple services, one way to do it is to move to the Kubernetes deployment. Once you’re in the kube, then you can use the spec below single core.

I observed the 10% of CPU utilisation, no latency increased, and even the auto scale never triggered during daily. What do you expect, the traffic will always be the same day to day. If there is an anomaly, then we should capture it, not reduce the CPU utilisation. After a month, I didn’t see any issues, and the latency was good, then I went to the next step, which was to try reaching 20% of CPU utilisation. The same observation happened, no latency increased, and the autoscale never triggered. Now we want to reach the 30% of CPU utilisation; however, the threshold we set was 30%. It means if we tried to utilise 30% of CPU, then it will automatically trigger the autoscale, then we need to change the threshold as well. It will look like this

custom-0.1-1024 2/10/60%

As per experience in Google, they use 60% in the normal, but still, I wasn’t confident at that time, so the way we utilised the 60% of CPU was to trigger the autoscale if it reached 60% of CPU utilisation. Several things I captured during the CPU utilisation exercise were:

I observed 30% was a sweet spot, as the latency more or less almost the same for less than 30%. If we went beyond there, it would increase the latency, but still be manageable if it reached 60%. More than that, the latency would increase beyond the SLA (in Tiket, we have a certain SLA to be achieved for latency), and the autoscale will not be able to catch up with the traffic. Kubernetes will take 1-2 minutes to scale and be ready to serve traffic.
Please be careful not to use 60% as the standard threshold if the service you use is latency-sensitive. A downstream service with the specific case needs to maintain the latency in a certain level; if it increases by 5-10ms may damage the overall flow in the other services. Be aware of that.
A service that runs in Kubernetes could go to 0.001 cores. However, if you see this symptom, then you need to review whether the service can be merged with other services, as the usage is pretty low. Creating a new service with very low means that another thing needs to be maintained.

Now, what was the result of this?

So, imagine the daily cost of our system previously 500 USD per day (of course, I wouldn’t say the real number due to it being confidential lol). After the exercise, we could cut almost 50% of the cost without any code changes. We don’t need magic; we just need to be diligent about the resources we spend and control them based on what we need.

During my FinOps work, I created a 3-point strategy to ensure we can achieve our cost optimisation goals:

Low Effort – High Impact: Focus on the utilisation.
Medium Effort – High Impact: Merging underutilised stateless or stateful services.
High Effort – High Impact: Rewrite or refactor the service with the new technology.

The next article will be on how to manage the lower environment.

About Immanuel Bayu

Previous PostFinops: Manage Lower Environment

Next PostHello World

Leave a Reply Cancel Reply