Everything that is wrong with mocking, bdd, and rspec
by Nick Gauthier on
Below is a public excerpt from the recently release RSpec book :
module Codebreaker describe Game do describe "#start" do it "sends a welcome message" do output = double('output') game = Game.new(output) output.should_receive(:puts).with('Welcome to Codebreaker!') game.start end it "prompts for the first guess" end end end
This example illustrates what is wrong with standard BDD as practiced today . Consider the following correct implementation of the spec above:
module Codebreaker class Game def initialize(output) @output = output end def start @output.puts("Welcome to Codebreaker!") end end end
Now consider the following refactorings of the start method:
@output.write("Welcome to Codebreaker!\n") @output.print("Welcome to Codebreaker!\n") > "Welcome to Codebreaker!".split.each{|c| output.write(c)}; output.write "\n" @output > "Welcome to Codebreaker!\n"
All of these produce the exact same result, which is sending the string to the output stream. However, all of these refactorings will fail the test suite.
Now consider the following implementation:
@output.puts "Welcome to Codebreaker!" @output > "You smell bad and are also ugly!"
That test passes the spec above, but it also insults our (paying?) users!
One of the primary reasons for having a test suite is so that I can refactor and have the tests validate my changes . If I have to change my tests and my code just to re-write the way I cause the same end result, then my test suite is useless for refactoring.
So now the obvious question is "Well Nick, how would you test it?"
require 'stringio' output = StringIO.new game = Game.new(output) game.start output.seek(0) assert_equal "Welcome to Codebreaker!\n", output.read
This test passes for all the implementations listed initially, and will fail for the implementation in which we insult our users.
Think about the difference here. The original example is saying "Make sure the game calls puts with the following argument" whereas my test is saying "Make sure the game outputs the following string to the output object". The second case is far more useful.
Mocking is a great tool for dealing with stable APIs. However, when introduced in areas of the codebase that have just a small amount of churn, they can cripple the test suite with their brittleness. How can we expect to embrace change when our codebase can't?
RSpec has become so intertwined with mocking and stubbing practices that many people are taking mocking to the extreme and stubbing out their own code.
I will close by deferring to the wisdom of a higher power:
It is the high coupling between modules and tests that creates the need for a mocking framework. This high coupling is also the cause of the dreaded “Fragile Test” problem. How many tests break when you change a module? If the number is high, then the coupling between your modules and tests in high. Therefore, I conclude that those systems that make prolific use of mocking frameworks are likely to suffer from fragile tests.
...
Keeping middle-class test doubles (i.e. Mocks) to a minimum is another way of decoupling. Mocks, by their very nature, are coupled to mechanisms instead of outcomes. Mocks, or the setup code that builds them, have deep knowledge of the inner workings of several different classes. That knowledge is the very definition of high-coupling.
- Uncle Bob Martin on Object Mentor
But thanks for promoting the book!
The issue here is that the behavior of the start() method is reflected in another object. Both examples inspect the collaborator to specify the outcome. Both have constraints that will cause future failures depending on how things change.
You could probably make something more flexible by using assert_match instead of assert_equal, but that wouldn't protect against a mean-spirited developer insulting users. Again, it's all trade-offs. Pick your poison.
I probably would do that in practice when I added more behavior to the method.
Decisions that aren't trade-offs never instigate an interesting debate.
I think Nick's example highlights exactly why overmocking is a bad thing. IMO, I don't think this is a failure (per se) of BDD, RSpec, or even Mocking as a concept, it's just not the best test for the situation.
This is why, as a general rule, I have learned to largely avoid mocking and stubbing unless I am writing unit tests against an external system.
Yes, test doubles are great for isolating the unit under test. However, that isolation comes at a cost.
Should an external dependency change, your test doubled tests will lie to you! The test double is blissfuly unaware that the interface it has doubled has changed. Therefore, your test will inaccurately pass. Under such circumstances, I want my tests to fail loudly!
In my experience, this has resulted in significant time lost tracking down production application problems!
TL;DR White-box testing coupled with test doubles is dangerous. I strongly prefer an integration test that exercises the external dependency.
The failure of the test is adequate to ensure that the test is revisited and updated.
Honestly, I admire so much of what you have done for the Ruby community. However, your tenacious adherence to mocking and stubbing as a routine TDD technique still utterly bewilders me. The brittleness of the resulting test hardly seems worth the benefit in almost any case except the rare external dependency.
It's definitely easy to shoot yourself in the foot with mocking (just like it is with ruby, git, or any of our other standard, powerful tools we use daily), but that doesn't mean it's not a useful, worthwhile tool: you just have to be aware of the dangers of over-mocking and mock with care.
http://grease-your-suite.heroku.com/
If you're test has clear preconditions, execution of the code under test, and clear expectations, I don't see how that leads to brittleness.
Can you give an example of one of your fragile pre-test double tests?
All of this is behavior of the code, you just need to spec what behavior is important: the methods called, or the output. Sometimes it might be important that a method is called, but not in this situation.
If Codebreaker only cares that @output begins with "Welcome to Codebreaker!", then Nick Gauthier's approach is ideal.
If Codebreaker relies on #puts being called on @output , then the "bad test" is ideal.
As always, everything's contextual: what's important to you?
I totally agree. What I dislike is the tendency for people to write a test against methods being called, when that is almost always not what is important in the test.
Continuously asking "What am I testing and why am I testing it" is very important to writing good code and good tests.
Hear hear!
I like using mocks and stubs in my unit tests as they allow me to isolate the area of the code I'm working in, and I don't have to worry about side-effects or complex set-up of other objects. I also use cucumber though, this gives me great black-box coverage of the system.
So while changing the original example will make this spec break, it won't make any other specs in the system break, because no other specs will depend on its behaviour. So long as the behaviour is preserved it won't break the acceptance/integration tests either.
Your refactoring is valid -- when the output object responds to #puts, #write, #print, and #<<, and implements them all in basically the same way.
You are free to change the internals of a method however you want, but remember that certain changes will have broader-reaching effects than others. Changing how you build up the string to, say, an array of strings that then get joined, should certainly be encapsulated. Changing how you call collaborators is a totally different thing -- you're changing the contract, potentially in a way that's incompatible with existing clients.
Focused examples (or unit tests, or microtests, or whatever) help answer the question, "what do I need to know to use this code?" When there is a collaborator involved, one of the key things to know is the protocol that the collaborator must adhere to. An interaction-based example makes that protocol explicit. Moreover, it establishes a checkpoint. Should you decide to call #write instead of #puts, you have to consider the possible implications to existing code.
I understand where you're coming from. When you run the program, there is no externally visible difference in behavior. That is why we tend to separate the duties of checking program behavior into acceptance tests. Your unit tests then become a place for you to check an object's correctness, which includes its own logic and its interactions with other objects. Don't assume that because the external behavior of the program stays the same, that the behavior internal to objects has stayed the same as well.
@Pat I understand that is the case when you are unit testing between collaborators where specific method invocation is important. However in the example I used, it was clear to me that the caller was not a collaborator but a higher level controller or supervisor. Actors at a higher level of abstraction should not be concerned with the implementation details but rather the results.
To sum it up, I interpreted this as an acceptance test.
Mocking external dependencies is actually considered to be a mocking no-no, because then you create a binding between your app and the external dependency's API. Instead, write a thin wrapper around that dependency that speaks in the domain of your app, and mock that as a means of isolating your app's code/specs from that dependency.
But that is only the tip of the iceberg when it comes to use cases served well by mocks. Others include: non-deterministic collaborators (intentionally random behavior) of collaborators), polymorphic collaborators (no need to have 3 examples when 1 would do), protocol definitions (where which methods are called represent adherence to a contract), collaborators that don't even exist yet, etc, etc.
I read your slides from rubyconf and I agree with your stance on mocking. However about 95% of people that copy rspec examples don't understand the implications and often code themselves into mocking hell. They are my target audience :-)
Part of the aim of the book is to make a distinction between application and object behavior: Cucumber is well suited to specify things at the application level (end to end tests/acceptance tests/customer tests, etc), and RSpec is well suited to specify things at the object level (unit tests, micro tests, developer tests, etc).
The example you cite comes from a chapter on describing the behavior of objects with RSpec (i.e. unit tests), after a chapter on describing the behavior of applications with Cucumber (i.e. acceptance tests), and is not at all intended to be an acceptance test. Hope that clears at least that part of this up for you.
Every time I've encountered mocking in test code (aside from 3rd party calls) I've seen it fall under the category of the content of this post.
While you describe a sound way to use mocks to test object behavior, I have never seen it executed properly. This is not to say it cannot be done, but that everyone is doing it wrong.
And yeah, you can't defeat the cargo-cult.
I'll have to take another gander at the (finally ;-) ) completed RSpec book. I bought the beta a very long time ago and haven't read it much since. I'm curious about what you have to say about mocking.
However, I've generally found white box testing to be too implementation specific. Granted, I realize that there are almost "camps" among TDD'ers: the 'behavioral statists' that I likely fall under, the 'unit mockists' that you likely fall under, and others. I tend to prefer writing acceptance/integration tests with only supporting unit tests as necessary to help me "fill in the gaps" where the feature is too complex to develop without unit tests to handle complex logic (see http://evan.tiggerpalace.com/articles/2010/12/02/stop-writing-code-you-dont-need/).
Via minitest (or test/unit in 1.9):
def test_start_output
assert_output "Welcome to Codebreaker!\n", "" do
Game.new.start
end
end
(blogger.... you suck. no pre tag allowed? no formatting help? bite me)
@dchelimsky: the rspec code would break too because EVERYONE uses .once. And remember, since we do test-first, it wouldn't break because we changed the output, it'd break because we added additional assertions to the test that we then needed to make pass. That isn't brittle. That's process.
The code base I'm thinking of was a rails app, and simple changes to models (i.e. adding a new required field or changing a validation) would cascade across the test suite and break lots of other tests because those tests depended on the existing behavior of the model when they shouldn't have. For example, a controller functional test that passed in valid attributes and would assert some result failed because the passed attributes were no longer valid. You may argue that the breaking controller test was a good thing, but I don't agree. I like to have lots of fast, isolated unit tests (which do use mocking carefully), as well as some full-stack integration tests (which don't use mocking). In this case, a change to a model my break a unit test and an integration test, but it's not going to cascade across my test suite.
Having controller tests that depended on the knowledge of what constituted a valid record was a very brittle approach. Mocking has worked well to help me unit test my controllers.
re: the rspec code and "once" - in fact, it breaks just 2 or 3 pages later in the book, and a wonderful learning opportunity emerges. Hooray for examples taken out of context!
That said, we've all been missing the deeper problem here, which was recognized immediately by Nat Pryce and reflected in these two tweets:
http://twitter.com/natpryce/status/12029918607048704
http://twitter.com/natpryce/status/12030541293424641
We've been talking about pain in the example, but none of us have "listened to the test". The real problem here is that there is a missing abstraction. Change "output" to "reporter" (or similar) and change "puts" to "message" (or similar) and the refactoring concern you raised goes away, everything is "speaking" in the domain, and all is right with the world.
A mocking unit testing strategy encourages testing the implementation of a method because it's easy. My argument is that it doesn't get you very much in the way of code validation.
When I write tests, I think about the effects of my code, and then I validate that the effects occurred.
For example, consider testing user registration. Which of these scenarios below is most useful?
When I register on the site
=>
1) Then the database should be asked to insert a user
2) Then there should be 50 users
3) Then there should be an additional user
4) Then I can log in
I like #4, because it focuses on the input and the ultimately desired result and goal of a feature. This user-acceptance, and not unit testing.
But in unit testing I like to focus on the same goals and ask the same questions.
For example, if I'm implementing a "users sorted by join date" method, I can see two ways of going about it:
1) Mock the user model and ensure that the storage adapter is called with the appropriate method and parameters to return a sorted list of users.
2) Create a couple of users, call the method, and see if the results are sorted.
#1 is white-box and #2 is black box. I'll pick black box every time because it focuses on my goals, and not my process.
Regarding the tweets:
Yes, it's easy to blame the tools, but what if a particular tool can be easily misused? In the best of hands, it is revolutionary and powerful, but in most people's hands it's destructive. For example, morphine.
I think mocking is being abused and overused and is causing people pain in the long run. It's enticing because its simple to implement, but I find the assertions it makes to be weak and brittle.
I will agree, however, that in cases where you have a stable API that mocking can make your life easier (especially in large and complicated codebases). However, I've always found that it's not that hard to create helper methods to setup elaborate scenarios in which to do a real goal-based test. There are great gems out there like cucumber and factory girl that allow you to encapsulate complicated functionality in simple steps.
By that logic, we should all stop using Ruby :)
re: the "sorted users" example, I agree that is likely better specified with a more black box approach.
The distinction is that, in the sorted users example, the User class returns the object (a list of users) on which expected outcomes can be specified.
In the output example, the outcome is specified _on a collaborator_. Whether we specify that as an interaction or post-action state-based verification, the example is still bound to the collaborator object, so there is little difference in terms of dependencies and brittleness.
Using a test double for "reporter" gives us the freedom to introduce different "reporter" solutions (command line, text file, web page, etc) without these examples failing. In that context, the use of a test double and a message expectation is actually less brittle than the alternatives.
It's more of an education perspective. I want to show people how they should and should not use tools so they don't dig themselves into a hole.
I think we're in agreement that when a collaborator's interface is well defined and stable that mocking is a good solution.
The problem is that whenever I see examples of mocking it's always in (what I deem to be) inappropriate scenarios.
If you showed a reporter class with a simple API of reporting messages, then showed the Codebreaker class working with a mocked Reporter, I'd be much happier with your example. I worry about a generation of rubyist who will mock their test suite to hell and then be stuck in a broken world.
Of course, anyone can write bad code in any framework, but I want to teach people the appropriate uses with solid examples.
I was very disappointed to see this example in the excerpts of the rspec book, because I've had to refactor and fix code like that and it's not fun.
I think we have the same goals in terms of deepening the understanding of when/where to use different tools. I don't think we agree on everything here (I don't "always" prefer anything), but I think we probably agree on more than we disagree on.
Cheers,
David