Your code coverage is lying
Code coverage is a metric which does NOT represent how much or how well our code is tested.
Many developers (and their managers) mistakenly believe that code coverage measures the percentage of code that has been (thoroughly) tested. Unfortunately, this is not the case.
Code coverage measures the percentage of code that has been executed during testing. This subtle distinction is crucial, as it highlights a significant gap between what developers think they're measuring and what they're actually measuring.
In fact, misusing code coverage can lead to a false sense of security where developers (and their managers) see magic number get bigger and red checkmarks becoming green, a greenwashing effect which is easy to put on a chart and show to stakeholders, but doesn't actually represent the quality of the codebase.
What is code coverage then?
The part "is executed" is key, the code which is marked as "covered" is the code that was executed during the test run. It doesn't mean that the code was tested, it just means that it was executed.
What's the difference? If it's executed while the tests run, it's tested, right? No.
Consider a following example: I'm testing a firecracker in my backyard. I light it up and, while it's burning, there's a sudden earthquake, the firecracker explodes. I log: Causes violent city-wide tremors and a tiny explosion, appropriate for all ages.
Code coverage is exactly like that: it writes down what happened while the tests were running, but it cannot differentiate were the tests the direct or indirect cause of the code being executed nor was the code execution properly tested or not.
Covered is not tested
In this example, we'll write code which is executed, but not actually tested, it will still get marked as fully covered:
If we run the test suite, we get the output:
In this example, the code coverage tool would report a high percentage of coverage, but in reality, the Calculator
code is basically not tested at all.
If we were to now judge the project state by code coverage alone, we'd be misled into thinking that the code is well-tested and no further work needs to be done.
Only mark as covered what is tested
Let's fix the test to actually declare what it's testing:
Now, when we run the test suite, we get the output:
A keen reader will notice this will actually decrease the overall code coverage percentage, but it's a more accurate representation of what's actually tested. This decrease is actually a good thing, as we'll see in Mutation testing.
PHPUnit-specific configuration to enforce this
We can enforce this by enabling two separate configuration options (both of which are enabled by default when generating config from scratch):
-
requireCoverageMetadata
forces the test to explicitly declare what code it's covering, without it, we'll can get an error like1) EagerFormatterTest::testExample This test does not define a code coverage target but is expected to do so
-
beStrictAboutCoverageMetadata
disallows any code which is not marked as covered or used to be executed during the execution of the test:
we need to explicitly mark the code as used:1) CoversFormatterTest::testExample This test executed code that is not listed as code to be covered or used: - App\Calculator
#[PHPUnit\Framework\Attributes\UsesClass(Calculator::class)] // rest of the CoversFormatterTest as before
Break stuff
But wait! In the test, the Calculator
actually does work so, technically, it's tested too.
Let's test this hypothesis by breaking the Calculator
on purpose and rerunning the test suite:
We now know this shouldn't work, but when we run the test suite, it still passes (?) and the code coverage is still 100%?! We've managed to break our "fully tested and 100% covered code" without any test noticing and failing, how is that even possible?
It just so happens that the test for the Formatter
has a very simple usage of the Calculator
and it doesn't test all the corner cases. Why should it, it's not supposed to test the Calculator
at all.
Mutation testing
This "break code on purpose in a controlled way, rerun the tests and see if they notice" is actually a technique called mutation testing. It's a way to test the robustness of the tests themselves, by introducing small changes in the code and seeing if the tests catch them. We can think of mutation testing as tests for tests, with a nice property that it's automated, and we can run it on every commit as long as we already have the tests.
Glossary:
- A "mutant" is a change to the code
- an "escaped mutant" is a mutant which no test caught
- a "killed mutant" is a mutant which was caught by a test
In our case, we can use the excellent Infection tool to run mutation testing on our codebase, it will show us our exact found mutant:
In the output, we can see several interesting things:
- it tried to modify the code in 6 different ways
- of the 6, five were caught by our existing test, one was not, success rate 83%
- this is the new metric we now get, Mutation Score Indicator (MSI), a measure of how well our tests are catching the mutants
Only mutate what is covered
What does all this have to do with code coverage?
We can use the code coverage as an allow-list for the mutation testing, basically saying which parts of code we consider tested and are ready to be mutated. It works sort of like double-entry bookkeeping, we have two independent ways to check if our tests are good, and if they're not, we can see where they're lacking. If we try to inflate the code coverage by marking code as covered which is not tested, the mutation testing will catch it immediately.
Use code coverage as opt-in for mutation testing
This principle allows us to opt in to mutation testing as we're increasing the coverage, and we can now use the coverage as a guide to see which parts of the code are not tested at all.
In our case, we'll use --only-covered
Infection flag:
Our Calculator
was not mutated here because we've explicitly said our tests are not testing it, meaning we can now judge the code coverage percentage as a more accurate representation of what's actually tested.
Conclusion
Testing is hard. Judging how well we test is harder.
When we start using a metric like the code coverage percentage as our ultimate quality goal, we're optimizing for lying to ourselves about the state of our codebase quality instead of actually measuring it.
We should strive to make the code coverage more accurate (not just as high as possible), we can keep ourselves honest about it by using mutation testing as an unbiased second opinion on how well our tests are actually testing the code.