伪造:受假设启发的 Haskell 收缩
Falsify: Hypothesis-Inspired Shrinking for Haskell (2023)

原始链接: https://www.well-typed.com/blog/2023/04/falsify/

Falsify是一个受Hypothesis启发的Haskell属性测试库,它提供“内部”收缩功能,避免了QuickCheck手动收缩和Hedgehog集成收缩的局限性,尤其是在monadic bind中。它利用无限样本树,不同于Hypothesis的线性流,从而实现更可预测的收缩和通过选择性函子对生成器独立性的更大控制。Falsify的方法能够跨越monadic bind进行收缩,因为它收缩的是底层样本而不是生成器。该库提供了常用数据类型的生成器、用于控制数据分布的倾斜范围以及受QuickCheck启发的函数生成/收缩功能。`testShrinkingOfGen`和`testMinimum`等工具允许测试生成器收缩行为。Falsify还提供了与QuickCheck和Hedgehog收缩风格的兼容性。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Falsify:Haskell 的基于假设收缩 (well-typed.com) birdculture 13 分钟前 6 分 | 隐藏 | 过去 | 收藏 | 讨论 加入我们,参加 6 月 16-17 日在旧金山举办的 AI 初创公司学校! 指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:

原文

Consider this falsify property test that tries to verify the (obviously false) property that all elements of all lists of up to 10 binary digits are the same (we will explain the details below; hopefully the intent is clear):

Hypothesis library showed the world how to achieve this. In this blog post we will introduce falsify, a new library that provides property based testing in Haskell and has an approach to shrinking that is inspired by Hypothesis. As we shall see, however, the details are quite different.

Background

In this first section we will discuss some of the background behind falsify; the next section will be a more tutorial-style introduction on how to use it. This section is not meant to an exhaustive discussion of the theory behind falsify, or how the theory differs from that of Hypothesis; both of those topics will be covered in a paper, currently under review. However, a basic understanding of these principles will help to use the library more effectively, and so that will be our goal in this first section.

Unit testing versus property based testing

In unit testing (for example using tasty-hunit), a test for a function f might look something like this:

Testing the Hard Stuff and Staying Sane, and then reading How to Specify It!: A Guide to Writing Properties of Pure Functions, both by the world’s foremost property-based testing guru John Hughes.

The importance of shrinking

Suppose we want to test the (false) property that for all numbers x and y, x - y == y - x:

Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane:

Random tests contain a great deal of junk—that is their purpose! Junk provokes unexpected behaviour and tests scenarios that the developer would never think of. But tests usually fail because of just a few features of the test case. Debugging a test that is 90% irrelevant is a nightmare; presenting the developer with a test where every part is known to be relevant to the failure, simplifies the debugging task enormously.

(emphasis mine). For our example, the numbers 38 and 23 are not particularly relevant to the failure; with shrinking, however, the counter-example we will get is

x = 0
y = 1

Indeed, this is the only counter-example we will ever get: 0 is the smallest number possible (“least amount of detail”), and the only thing that is relevant about the second number is that it’s not equal to the first.

Parsing versus generation

Generation of inputs relies on pseudo-random number generators (PRNGs). The typical interface to a PRNGs is something like this:

QuickCheck to shrinking is to pair a generator with a shrinking function, a function of type

This works, but it’s not without its problems; see my blog post Integrated versus Manual Shrinking for an in-depth discussion. The key insight of the Hypothesis library is that instead of shrinking generated values, we instead shrink the samples produced by the PRNG. Suppose we unfold a PRNG to a stream of random samples:

definition of Gen in QuickCheck, you will see it’s actually different to the definition we showed above:

Generate Functions.

The falsify definition of Gen

If we apply the insight from Hypothesis (that is, parse samples rather than generate using PRNGs) to this new setting where splitting PRNGs is a fundamental operation, we arrive at the definition of Gen in falsify. First, unfolding a PRNG does not give us an infinite stream of samples, but rather an infinite tree of samples:

actual definition in falsify has an additional constructor Minimal, which is a finite representation of the infinite tree that is zero everywhere. This is a key component in making this work with infinite data structures; see upcoming paper for an in-depth discussion. Users of the library however generally do not need to be aware of this (indeed, the sample tree abstraction is not part of the public API).

Consequences of using sample trees

Arguably all of the key differences between Hypothesis and falsify stem from the difference in representation of samples: a linear stream in Hypothesis and an infinite tree in falsify. In this section we will discuss two consequences of this choice.

Shrinking the sample tree

First, we need to decide how to shrink a sample tree. In Hypothesis, the sample stream (known as a “choice sequence”) is subjected all kinds of passes (15 and counting, according to Test-Case Reduction via Test-Case Generation: Insights from the Hypothesis Reducer), which shrink the sample stream according to lexicographical ordering; for example:

..¸ x, ..        < .., x', ..       -- shrink an element (x' < x)
.., x, y, z, ..  < .., x, z, ..     -- drop an element from the stream
.., x, y, z, ..  < .., y, z, x, ..  -- sort part of the stream (y < z < x)

When we are dealing with infinite sample trees, such a total ordering does not exist. For example, consider the following two trees:

Integrated versus Manual Shrinking; this blog post discusses not only the problems in QuickCheck, but also shows one alternative approach, known as integrated shrinking, used by QuviQ QuickCheck and made popular in the Haskell world by the library hedgehog.

The problem with integrated shrinking is that it does not work across monadic bind. The linked blogpost explains this in great detail, but the essence of the problem is not hard to see. Consider the type of monadic bind:

predictability we discussed above.

We could try to make the two generators shrink independent from each other by simply running both of them, and using the boolean only to choose which result we want. After all, Haskell is lazy, and so this should be fine:

selective applicative functors give us. A detailed discussion of this topic would take us well outside the scope of this blog post; in the remainder of this section we will discuss the basics only.

Gen is a selective functor, which means that it is an instance of Selective, which has a single method called select:

definition in the falsify library itself.

Tutorial

With the background out of the way, let’s now consider how to actually use the library. Probably the easiest way to get started is to use the tasty integration. Here is a minimal template to get you started:

tasty package, as well as falsify of course. If you want, you can also use the Test.Falsify.Interactive module to experiment with falsify in ghci.

Getting started

Suppose we want to test that if we multiply a number by two, the result must be even. Here’s how we could do it:

  • unless) for monads are available
  • gen runs a generator, and adds the output of the generator to the test log. (The test log is only shown when the property fails.)
  • Gen.int is an alias for Gen.integral, which can produce values for any Integral type. There is no analogue of QuickCheck’s Arbitrary class in falsify: like in hedgehog and in Hypothesis, every generator must be explicitly specified. For a justification of this choice, see Jacob Stanley’s excellent Lambda Jam 2017 presentation Gens N’ Roses: Appetite for Reduction (Jacob is the author of hedgehog).
  • The specified Range tells the generator two things: in which range to produce a value, and how to shrink that value. In our example, withOrigin takes an “origin” as explicit value (here, 0), and the generator will shrink towards that origin.
  • testFailed is the primitive way to make a test fail, but we shall see a better way momentarily.

Predicates

Suppose we mistakingly think we need to multiply a number by three to get a even number:

Predicate. A predicate of type

is essentially a function

  • P.even, like even from the prelude. is a predicate that checks its argument is even
  • P.dot, like (.) from the prelude, composes a predicate with a function. In addition to the function itself, you also specify the name of the function, so that that name can be used in error messages.
  • [(.)][docs:falsify:dollar], like `()` from the prelude, applies a predicate to a named argument.

The use of predicates is not required, but can be very helpful indeed. For our running example, this will produce this test failure message instead:

multiply3_even_pred: FAIL
  failed after 2 successful tests and 13 shrinks
  not (even (multiply3 x))
  x          : 1
  multiply3 x: 3

Ranges, Labelling

We saw the use of withOrigin already, and earlier in this blog post we used between; a generator such as

skewedBy. A generator such as

collect to collect some statistics; specifically, in what percentage of the tests x is an element of xs. If we run this with a skew of 0, we might see something like:

100000 successful tests

Label "elem":
   94.6320% False
    5.3680% True

In only 5% of cases the element appears in the list. There are various ways in which we could change that distribution of test data, but the simplest way is simply to generate more values towards the lower end of the range; if we run the test with a skew of 5 we get

Label "elem":
   41.8710% False
   58.1290% True

Generators

Nearly all generators are built using prim as their basic building block, which returns the next sample from the sample tree. Higher-level generators split into two categories: “simple” (non-compound) generators that produce a value given some arguments, and generator combinators that take generators as input and produce new generators. Some important examples in the first category are:

  • integral, which we discussed already
  • bool, which produces a Bool, shrinking towards a choice of True or False
  • elem, which picks a random element from a list, and shuffle, which shuffles a list
  • etc.

The library also offers a number of generator combinators; here we list the most important ones:

  • choose we saw when we discussed Selective functors, and chooses (uniformly) between two generators, shrinking towards the first.

  • list takes a range for the list length and a generator and produces a list of values. Unlike the simple “pick a length and then call replicateM” approach from the example from the introduction, this generator can drop elements anywhere in the list (it does this by using the combinator mark to mark elements in the list; as the marks shrink towards “drop”, the element is removed, up to the specified Range).

  • frequency, similar to the like-named function in QuickCheck, takes a list of generators and frequencies, and chooses a generator according to the specified frequency. This is another way in which we can tweak the distribution of test data.

    The implementation of frequency ensures that the generators can shrink indepedently from each other. This could be defined just using Selective, but for improved performance it makes use of a low-level combinator called perturb; see also bindIntegral, which generalizes Selective bindS, and has significantly better performance than bindS.

Generating functions

One of the most impressive aspects of QuickCheck is that it can generate, show and shrink functions. This is due to a functional pearl by Koen Claessen called Shrinking and showing functions; the presentation is available on YouTube and is well worth a watch. We have adapted the QuickCheck approach (and simplified it slightly) to falsify; the generator is called fun. Here is an example (Fn is a pattern synonym; you will essentially always need it when generating functions):

Dependencies between commands of my blog post “An in-depth look at quickcheck-state-machine” discusses this kind of problem in the context of quickcheck-state-machine.

Testing shrinking

When we use internal (or indeed, integrated) shrinking, we don’t write a separate shrinking function, but that doesn’t mean we cannot get shrinking wrong. Shrinking never truly comes for free! As a simple example, consider writing a generator that produces any value below a given maximum (essentially, a more limited form of integral). A first attempt might be:

testShrinkingOfGen:

testMinimum. Let’s consider the naive list generator from the introduction one more time:

testShrinkingOfGen (and its generalization testShrinking) as well as testMinimum extensively to test falsify’s own generators.

Compatibility

Finally, falsify offers two combinators shrinkWith and fromShrinkTree which provide compatibility with QuickCheck style shrinking