visit
Test different databases on the exactly same hardware. In a number of database benchmarks, I’ve seen people benchmark competitors on different hardware. For example, in benchmark they say “We actually wanted to do the benchmark on the same hardware, and m5.8xlarge, but the only pre-baked configuration we have for m5.8xlarge is actually the m5d.8xlarge … Instead, we run on a c5.9xlarge instance”. Bad news, guys: when you run benchmarks on different hardware, at the very least you can’t then say that something is “106.76%” and “103.13%” of something else. Even when you test on the same bare-metal server it’s quite difficult to get a coefficient of variation lower than 5%. 3% difference on different servers can be highly likely ignored. Provided all that, how can one make sure the final conclusion is true?
Test with full OS cache purged before each test. At DB Benchmarks we specialize in latency testing. We make sure some query against some database takes 117ms today, tomorrow, and in a week. That’s a fundamental thing in the platform, without it nothing else matters. It’s hard. To make it happen it’s important to make sure that when you test a query the environment is exactly the same as the previous time. One of the things people always forget about is purging the OS cache. If you don’t chances are you’ll have part of the data your query has to read from the disk already in memory which will make the result unstable.
Measure cold run separately. Disks, be it an NVMe or an HDD are all still significantly slower than RAM. People that do benchmarks often don’t pay enough attention to it, while it’s important, especially for analytical queries and analytical databases where cold queries may happen often. So the principle is: to measure cold run time separately. Otherwise, you completely hide the results of how the database can handle I/O.
The database which is being tested should have all its internal caches disabled. Another related thing is to disable internal database caches. Otherwise, you’ll just measure cache performance which might also make sense for some tests, but normally it’s not what you want.
Nothing else should be running during testing. Otherwise, your test results may be just very unstable since your database will have to compete with another process.
You need to restart the database before each query. Otherwise, previous queries can still impact the current query’s response time, despite clearing internal caches.
You need to wait until the database warms up completely after it’s started. Otherwise, you can at least end up competing with DB’s warmup process for I/O which can spoil your test results severely.
Test on a fixed CPU frequency. Otherwise, if you are using an “on-demand” CPU governor (which is normally a default) it can easily turn your 500ms response time into a 1000+ ms.
Test on SSD/NVME rather than HDD. Otherwise depending on where your files are located on HDD you can get up to 2x lower/higher I/O performance (no joking, 2x), which can make at least your cold query results wrong.
Most important: more repetitions and control over CV. It’s probably the most common mistake in latency benchmarking: people run a query 2–3 times, calculate an average and that’s it. In most cases, a few attempts are not enough, the next time you run the same query you can get a 50% different result.