Why I don’t trust benchmarks and you shouldn’t too

Jun 18, 2024

two go kart drivers on race track — Photo by David Armstrong on Unsplash

Who doesn’t love fast things?

And what’s better than fast is something even faster. Faster processors, faster cars and faster internet. This also applies when selecting technologies for our development. It is only natural to want a faster database, framework, or a programming language because we need to stay relevant in such a competitive age. With so much demand for such information, there are benchmarks for almost everything in our field. But, speed is not the only metric that matters.

And benchmark results don’t usually paint the right picture.

Benchmarks are not designed for your use case

Let’s consider a fictional benchmark of gRPC vs. REST API.

Typically, these tests are done with X number of concurrent connections, to simulate real-world scenario. Let’s say the tests were configured with 50 concurrent users. But why 50? Is it an arbitrary number? Ideally, the concurrency value should be relevant to your domain, i.e. what is the target concurrency that your service needs to support to be successful.

That’s not including other considerations that you might have. Was the code written in the language you need to use? Is it transferring a similar sized payload like your service? Was compression enabled? How about encryption or https?

With so many other parameters at play, there can be no one benchmark that fits your use case exactly. Therefore, any outcome from such benchmark can only be viewed as a guideline at best, to be validated later on your own.

Benchmark setup might not be ideal

Then there are the tools. In our example of gRPC vs REST, there are various options to run the tests.

One direct option would be to use tools already available for both communication styles, ghz for gRPC and ab for REST. But, that adds a new variable to the tests because the tools are not the same. While ghz is written in Golang, ab is in C. Who knows what overhead difference contributed by the tools.

Another example would be toggling the compression. Intuition would say, enabling compression would mean the operation would be faster because the data being transferred would be drastically smaller. But, that is not necessarily the case. That assumption ignores the fact that compression demands extra computation time, both for the sender and receiver. Therefore, it might reduce the overall speed and throughput. Not to mention that compression library in different languages have different performance. Depending on the setup of the benchmark, toggling compression might skew the data.

boy singing on microphone with pop filter — Photo by Jason Rosewell on Unsplash

Benchmarks can be optimized to sell a pitch

Although it might not be intentional, but if the author of the benchmark tests is trying to pitch their product, the tests might be optimized for that particular use case.

Consider a new library for serializing data to compete with a popular JSON library such as Json.Net from Newtonsoft. It might claim to be faster and more memory efficient. Plus, it has benchmarks data and graphs to back it up. But, that benchmark might only apply for a very specific mix of data types. Will it still have the same memory consumption and performance if the data is heavy with strings or a large collection? How about repeated operation of serializing and deserializing of a fluctuating data size that goes into the heap, would it help there?

The only way to be certain is to verify it yourself.

Benchmarks do not tell the whole story or just an old one

Unless the benchmark is repeated over time, it will be outdated eventually, similar to many things in tech.

It is one of the reasons I am suspicious of test results posted on a README in GitHub that has not been updated for ages. The results could be outdated for many reasons, such as the code got slower because the slew of new features being added, or the competition got faster due to recent optimization exercise they did. Just check the date it was published.

Ultimately, benchmark results do not and cannot tell the whole story. It is at best just a piece of puzzle part of a bigger picture. When evaluating technology to adopt, there are many other metrics to consider than just speed and throughputs. Support, maturity, reliability, adoption, maintainability and cost, just to list a few other things that I, personally, would highly consider when assessing my options.

Normal Form

Discussion about this post