Finops: Manage Lower Environment

I remembered when I had a conversation with the VP of Infra, Ichsan. He asked me about the cost to maintain the lower environment in Tiket. At that time, I just said the random number like, “Oh it should be around 10 USD per day”. Once again, I’ll not share the real number for confidentiality, just use your imagination. He laughed, and he said, “It’s almost 10x than the number you mentioned”. What the hell, how can the lower environment be so expensive like that? Even the cost was way more expensive than the project I maintain in production. 2-3 months later, Ichsan resigned, and the task was assigned to me.

Actually, the whole project for cost optimisation was assigned to me based on my previous article here.

First, I analyse how many environments we have on Tiket. In 2024, we had 4 environments, which were dev, staging, production and sandbox. Dev means for the development, the playground for the dev team to do the testing before sharing it with QA. In Tiket, we called it Pegasus. May you ask why the name was Pegasus? Because Firman (the previous VP Infra) liked Saint Seiya, he named the dev environment Pegasus. Nope, just joking, don’t take it seriously LOL. For staging, it’s for QA to test the feature once dev it’s done with the testing. Production, ya, you know it’s for the customer. For the sandbox it’s for the partner who wants to integrate with the Tiket API. When it comes to the lower environment, it belongs to the other environments except for the production.

Now it’s time to inventorized each machine we spent in every environment. At that time, I was amazed that in the dev environment, we spent almost 10 USD per day (remember it’s not a real number, but imagine it’s huge). So I asked the team, Did you guys use the dev environment daily? I received very surprising information that nobody used it daily. What a useless environment we created, maintained and spent money on for nothing. So by that time, I removed every single thing in the dev environment and ensured no machine was running on that environment. It was a very good lesson always to ask if we need to have more environment or not, because more environment means more things to be maintained. If you don’t use it, destroy it.

Second step, I start to standardise the machine we use in the lower environment. As we use GCP, we can check the cost by using the GCP calculator. You need to avoid using a custom machine as the cost is 5% higher than the normal machine with the same specs. For the SOP, we expect to use a low-cost machine like the E2 machine. Our goals in the lower environment are only for testing the feature, not performance; it’s expected not to use the high-spec machine. The clear requirements for spawning a new machine will be as these:

For the machine type, always use the E2 machine. Use the minimum core as possible.
For disk type, always use the standard disk.
For disk size, always use the smallest disk possible.

Now, we are already equipped with the cheapest machine possible. However, it’s not enough, as if we use a standard machine, the cost will still be high. We need to use the spot machine (as shared in the screenshot above). What is spot VM?

Affordable compute instances suitable for batch jobs and fault-tolerant workloads. Spot VMs offer the same machine types, options, and performance as regular compute instances. If your applications are fault tolerant and can withstand possible instance preemptions, then Spot instances can reduce your Compute Engine costs by up to 91%.

Ref: https://cloud.google.com/solutions/spot-vms

The spot VM is good only for a stateless service. If you use it for a stateful service, based on our experience, most of the time, the applications running behind it are unable to perform a graceful shutdown. For example, a VM that runs MongoDB, typically once the VM was taken down by GCP and received the substitution, the MongoDB failed to come up, and the data became corrupted. If you want to use it for a stateful service, please do it at your own risk.

Afterwards, we need to adjust the service spec we use in the lower environment. Like the previous 2 steps, we need to share with stakeholders about the application we use in the lower environment; nobody can guarantee 99% uptime. If the service is down, nobody can blame us; we just need to fix it. By default, we differentiate services by the programming language:

Java services: Default spec is 0.5 core with 1 GB RAM.
Golang services: Default spec is 0.1 core with 128 GB RAM.
NextJS (JavaScript) services: Default spec is 0.1 core with 128 GB RAM.

Some services need more spec than the default, which is fine. However, if we create a standard spec, we can control how many nodepools we need to spawn from the start. In Tiket, most of the services are already running in Kubernetes. It makes it easier to do bin packing in a lower environment. Please don’t ask if we do it in production LOL.

Last initiatives, we call it the infra shutdown. It means, by the time we don’t use it, we create a script to shut down all the machines in lower environments and get it up before the engineer uses it again. If your team are mature enough, perhaps you can use an ephemeral environment approach instead of creating a dedicated lower environment. So what’s an ephemeral environment?

An ephemeral environment is a short-lived, isolated deployment of an application. They are usually based on a branch or PR/MR. Ephemeral environments are spun up on-demand for running tests, previewing features, and collaborating asynchronously across teams.

Ref: https://ephemeralenvironments.io/

Previously, we wanted to go with this approach; however, due to its complexity, the RFC failed, and we went with shutting down the infra machine every night and weekend. It can save the cost for almost 30-40%.

So, for summarising:

Always inventorize your machine, even in a lower environment. Do it periodically so you know if someone creates a machine without accountability, you can take down the machine, and if possible, take down that person LOL.
Create an SOP for creating a new machine in a lower environment. Make sure everybody follows it.
Use the Spot VM. Be cautious if you use it for a stateful machine.
Ensure the stateless service is running in the default spec in the lower environment.
Ephemeral is the key, just kidding, if you can’t do it, you can create a script to periodically turn it on and off your machine in the lower environment. We called it an infra shutdown.

The next article will focus on how to utilise the discount.

About Immanuel Bayu

Previous PostFinops: Utilizing the Commited Discount

Next PostFinops: Start from the Scratch

Leave a Reply Cancel Reply