To measure or not to know, that is the question

How can we accelerate delivery times in our teams?

By Martín Ramírez, Site Reliability Engineer at etermax

“If you can not measure it, you can not improve it”, said Lord Kevin. This quote, which you have probably heard more than once, illustrates the importance of measuring. And the CI/CD process is not exempted from this proposition.B

This post explores how we implement metrics in our pipelines at “Loki 🐊”, the SRE team of etermax’s Cloud and Platform.

The goal was clear, we wanted to improve our automated delivery process. On this basis, it became necessary to have metrics about the following:

the number of jobs running
the final status of jobs (success / fail / canceled)
the average time
and last but not least, the tech stacks that are being implemented

We currently provide backend and mobile application services (native code, unity, and flutter), among others. This implies a flexible solution that doesn’t add complexity and is easy to implement by our development teams. We seek to know the current stacks used by each team, normalize the use of specific versions, improve execution times, plan the stacks we will use in the future, and always be ready to keep moving forward.

So, how can we obtain this data?

We, in the Cloud and Platform section, have been fostering the development of Products to facilitate the use of the different company tools and provide our teams with an abstraction layer over the infrastructure used for more than a year now.

In this context, eterPIPE, our CI/CD product, was born.

eterPIPE encompasses the different CI/CD workflows in etermax, which were divided into two main categories for better management.

App Pipelines: builds and deploys to the different app stores.
Service Pipelines: builds and deploys of the different platform needs and services.

From this product vision, we explored different tools to add to eterPIPE and measure our App Pipelines:

Default metrics provided by GitLab
Gitlab Exporter
Developing our own solution

Default metrics provided by GitLab

GitLab doesn’t have default metrics or a dashboard including the information we need. Even if the GitLab Runners make metrics available, they don’t provide specific information on the jobs running. For example, we cannot access the pipeline’s environment variables. For this reason, we decided to explore other alternatives.

Gitlab Exporter

Following the documentation provided by Gitlab, we analyzed gitlab-ci-pipelines-exporter. This is an excellent tool, but it presents some issues: it only lets us monitor the status of the jobs that were last run, making it impossible to obtain a history of build times and access the pipeline’s environment variables to obtain the versions used.

Developing our own solution

We thought of using Go client library to develop a metrics exporter that calls the GitLab API and makes them available at /metrics. The downside is that the GitLab API does not provide the values of the environment variables needed to know the stacks.

This took us to analyze our own infrastructure. We have a custom runner running the jobs. This is because we need to connect to the external platform to run the builds. This runner has all the environment variables available for the jobs. If we could configure a step to make the runner push all this information to an endpoint that builds and exposes the metrics, we would solve the problem.

We had just begun to investigate and design this endpoint when we bumped into Prometheus Pushgateway. This tool solved all our problems, including the need to develop something new.

With this idea in mind and a lot of hard work, we implemented the following solution:

During this first instance, we managed to get metrics on the number of running jobs, final status, average time, and stacks used by each team in our App pipelines.

As a result, we accessed the error rate, reduced the team’s operational workload, simplified failure diagnosis, generated alerts about potential problems, had the opportunity to iterate the platform and measure the impact, and planned integral improvements with a Roadmap visible to the entire company prioritizing the new functionalities in an organized way.

In the upcoming months we will analyze the gathered information and think how we can continue to improve our runtimes, stack normalization and metrics. This will allow us to constantly enhance eterPIPE, improving the speed and quality of delivery of our product teams.

In conclusion, we keep moving forward. #BeAGameChanger

“This is the Way”

Bibliography

December 14, 2022

Natacha Cortabarria

EN, Insights, Technology

Data Science, Software Development