To the top

Page Manager: Webmaster
Last update: 9/11/2012 3:13 PM

Tell a friend about this page
Print version

Software microbenchmarkin… - University of Gothenburg, Sweden Till startsida
To content Read more about how we use cookies on

Software microbenchmarking in the cloud. How bad is it really?

Journal article
Authors C. Laaber
Joel Scheuner
Philipp Leitner
Published in Empirical Software Engineering
Volume 24
Issue 4
Pages 2469-2508
ISSN 1382-3256
Publication year 2019
Published at Department of Computer Science and Engineering, Computing Science (GU)
Pages 2469-2508
Language en
Keywords Performance testing, Microbenchmarking, Cloud, Performance-regression detection, Computer Science, OGRAMMING, SYSTEMS, LANGUAGES, AND APPLICATIONS, PROCEEDINGS22nd International
Subject categories Computer and Information Science


Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to >100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.

Page Manager: Webmaster|Last update: 9/11/2012

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?