visit
This is the first and introductory post in a series about property-based testing. This post explains what property-based testing is, and what a typical property-based test looks like. The rest of the series will dive deeper into how libraries for property-based testing are implemented - admittedly I'm biased as the maintainer of but they are microcosms of interesting and widely applicable ideas.
In traditional unit testing, a test consists of an example input that is fed to the system under test, followed by an assertion that the output is what we expect. In property-based testing a test instead relates the input and the output of the system under test, by asserting that certain general properties hold on a ton of automatically generated inputs.
As an example, let's imagine a function sort_by_age
that takes as argument a list of Person
objects, which have various fields (like name
), and returns them sorted by age
. A unit test for our problem could be:
persons = [
Person("Max",12),
Person("Nemo",22),
Person("Eloise",15)
]
actual = sort_by_age(persons)
expected = [
Person("Max",12),
Person("Eloise",15),
Person("Nemo",22)
]
assert actual == expected
To instead write a property-based test for sort_by_age
, consider how to write a function is_valid(persons_in, persons_out) -> bool
which given the input and the output of sort_by_age
, returns whether that input-output pair is valid.
Chances are you've come up with a few things to check that at first blush are trivial. For example, is_valid
could check whether the two lists are of equal length - after all, sorting a list should not change its length:
def is_valid_length(persons_in, persons_out):
assert len(persons_in) == len(persons_out)
Another one could be checking that the list persons_out
is indeed sorted by age:
def is_valid_sorted(persons_in, persons_out):
assert all(
p[i].age <= p[i + 1].age
for i in range(len(persons_out)-1)
)
Lastly we could check that sorting Person
values doesn't change them - for brevity we'll just check that their name hasn't changed:
def is_valid_unchanged(persons_in, persons_out):
assert { p.name for p in persons_in } == { p.name for p in persons_out }
Not convinced? Prepare to be! Given an is_valid
method a property-based test for sort_by_age
is easy to write:
for _ in range(ONE_GAZILLION):
persons_in = generate_persons()
persons_out = sort_by_age(persons_in)
is_valid_length(persons_in, persons_out)
is_valid_sorted(persons_in, persons_out)
is_valid_unchanged(persons_in, persons_out)
Compared to unit-testing - perhaps better characterized as example-based testing - where we come up with an example input and an expected output, here something (we'll come back to this soon) comes up with inputs, and we test that the input and output pair represent valid behavior of the system under test.
The power of this approach lies in combining multiple, individually simple properties. It's easy to come up with a trivially wrong implementation of sort_by_age
that passes one of the is_valid
properties - for example, to pass is_valid_unchanged
, sort_by_age
just needs to return the input list as-is. It's way harder to come up with an implementation that passes all three of them and that doesn't do exactly what we expect.
If you thought about writing is_valid
yourself a few paragraphs ago, you hopefully experienced that the thought process behind writing a property-based test is quite different from writing a unit test. Anecdotally, this is what many people struggle with when first exposed to property-based testing - "I have trouble coming up with properties" is an often heard complaint.
The rewards however are that we identified various aspects of the behavior of our function and made them clearly, separately and explicitly visible in the test code. While going through this process - and it is a process that can require some thought! - I typically learn something new about the implementation. For sort_by_age
perhaps you were wondering if the sort should be stable. According to the properties above it needn't be, but we could easily add that requirement if we wanted to. Such a concern would likely go unnoticed altogether with traditional unit testing. In an example-based test properties are all implicit, and anyone who reads the tests needs to infer them from the examples. A property-based test is like a lightweight specification, expressed directly in our programming language of choice.
Now, how about the magic generate_persons()
- how does that work? And how can we be sure it's actually generating sensible values, and not a gazillion empty lists?
This is exactly where property-based testing libraries like QuickCheck and its many descendants come in. They provide functions that allow you to write value generators for your domain types (like Person
). They also have a number of basic generators for primitive values, lists, dictionaries, and so on, that can then be composed into generators for custom types. We'll go into the structure of that API in much more detail in the rest of the series - to give you an idea here's a generator written in a real property-based testing library for Python, :
# hypothesis calls generators "strategies":
from hypothesis import strategies as st
composite_generator = st.tuples(st.booleans(), st.text())
@given(st.tuples(st.booleans(), st.text()))
def test_tuples(t):
assert len(t) == 2
assert isinstance(t[0], bool)
assert isinstance(t[1], str)
The best-known strategy is pseudo-random generation - this is the approach that QuickCheck pioneered. The idea is to generate values randomly, with a distribution skewed to values that typically cause bugs, like 0, -1, 1 for integers. This strategy is surprisingly effective in practice and relatively simple to implement. The downside is that it can generate very large values, and when a bug is found with a large value typically only a small part or aspect of the value triggers the bug. We don't know what this aspect is, which makes it hard to understand or debug the failure. To help with that, all modern property-based testing libraries attempt to "shrink" values if they fail the test - that is, when a test fails they'll try to find smaller values (e.g. smaller integers, shorter strings) that still fail the test. This process is comparable to running git bisect
to identify a commit that causes something to break.
Another strategy is exhaustive generation. There, all possible values for some type are generated in some well-defined order - typically from "small" to "large" values, and with some upper bound, as once you go past booleans the number of values for most types are (countably) infinite. For example, trying all the integers between -20 and 20 in "zig zag" order 0,1,-1,2,-2,.... for Haskell and for Scala do this, but this approach is not so well-known. It's a shame as random and exhaustive generation are complementary - if you think of generating values as exploring some large space to find failing tests, random generation is a serendipitous type of exploration, while exhaustive generation is diligently mapping out all the paths in some area.
The final strategy is what I'll call optimization-guided generation, where an algorithm tries to generate values such that some measure is maximized. In the approaches I know about, this measure is typically code coverage, and is achieved through smart code analysis. Effectively the code under test is symbolically executed, and a solver calculates input values that cause branches to go one way or the other. for .NET is one of those - unfortunately no longer maintained and probably not usable anymore by now. I'm not convinced the complexity is worth it - clearly the generation strategy is orders of magnitude more computationally expensive, not to mention the development and maintenance cost of these libraries. This may explain why I mostly see them coming out of academia or research.
QuickCheck is a tool for testing Haskell programs automatically. The programmer provides a specification of the program, in the form of properties which functions should satisfy, and QuickCheck then tests that the properties hold in a large number of randomly generated cases. Specifications are expressed in Haskell, using combinators defined in the QuickCheck library. QuickCheck provides combinators to define properties, observe the distribution of test data, and define test data generators.
I stole the sorting example and learnt about this approach for explaining property-based testing from the paper Using Relational Problems to Teach Property-Based Testing by John Wrenna, Tim Nelsona, and Shriram Krishnamurthia. This made a nice change from the typical (and much maligned) "the reverse of the reverse of a list is the original list" example. Any errors and omissions are entirely my responsibility.