Everything that is wrong with mocking, bdd, and rspec

by on

Below is a public excerpt from the recently release RSpec book :

module Codebreaker
  describe Game do
    describe "#start" do
      it "sends a welcome message" do
        output = double('output')
        game = Game.new(output)
        output.should_receive(:puts).with('Welcome to Codebreaker!')
        game.start
      end
      it "prompts for the first guess"
    end
  end
end

This example illustrates what is wrong with standard BDD as practiced today . Consider the following correct implementation of the spec above:

module Codebreaker
  class Game
    def initialize(output)
      @output = output
    end
    def start
      @output.puts("Welcome to Codebreaker!")
    end
  end
end

Now consider the following refactorings of the start method:

@output.write("Welcome to Codebreaker!\n")
@output.print("Welcome to Codebreaker!\n")
> "Welcome to Codebreaker!".split.each{|c| output.write(c)}; output.write "\n"
@output 
> "Welcome to Codebreaker!\n"

All of these produce the exact same result, which is sending the string to the output stream. However, all of these refactorings will fail the test suite.

Now consider the following implementation:

@output.puts "Welcome to Codebreaker!"
@output 
> "You smell bad and are also ugly!"

That test passes the spec above, but it also insults our (paying?) users!

One of the primary reasons for having a test suite is so that I can refactor and have the tests validate my changes . If I have to change my tests and my code just to re-write the way I cause the same end result, then my test suite is useless for refactoring.

So now the obvious question is "Well Nick, how would you test it?"

require 'stringio'
output = StringIO.new
game = Game.new(output)
game.start
output.seek(0)
assert_equal "Welcome to Codebreaker!\n", output.read

This test passes for all the implementations listed initially, and will fail for the implementation in which we insult our users.

Think about the difference here. The original example is saying "Make sure the game calls puts with the following argument" whereas my test is saying "Make sure the game outputs the following string to the output object". The second case is far more useful.

Mocking is a great tool for dealing with stable APIs. However, when introduced in areas of the codebase that have just a small amount of churn, they can cripple the test suite with their brittleness. How can we expect to embrace change when our codebase can't?

RSpec has become so intertwined with mocking and stubbing practices that many people are taking mocking to the extreme and stubbing out their own code.

I will close by deferring to the wisdom of a higher power:

It is the high coupling between modules and tests that creates the need for a mocking framework. This high coupling is also the cause of the dreaded “Fragile Test” problem. How many tests break when you change a module? If the number is high, then the coupling between your modules and tests in high. Therefore, I conclude that those systems that make prolific use of mocking frameworks are likely to suffer from fragile tests.

...

Keeping middle-class test doubles (i.e. Mocks) to a minimum is another way of decoupling. Mocks, by their very nature, are coupled to mechanisms instead of outcomes. Mocks, or the setup code that builds them, have deep knowledge of the inner workings of several different classes. That knowledge is the very definition of high-coupling.

- Uncle Bob Martin on Object Mentor

Comments

dchelimsky
The test you propose will fail as soon as an additional message (praising the user?) needs to be displayed in response to start(), so it, too, is brittle due to its constraints. It's all trade-offs, my friend.

But thanks for promoting the book!
Nick Gauthier
Of course it will fail, because it is a spec on the start method, whose behavior has changed :-)
dchelimsky
It's behavior has been _extended_. If the existing message changed, that would be one thing, but the fact that there is an additional side-effect of calling start() should not cause this expectation to fail.

The issue here is that the behavior of the start() method is reflected in another object. Both examples inspect the collaborator to specify the outcome. Both have constraints that will cause future failures depending on how things change.

You could probably make something more flexible by using assert_match instead of assert_equal, but that wouldn't protect against a mean-spirited developer insulting users. Again, it's all trade-offs. Pick your poison.
Nick Gauthier
Yeah I agree. It depends on how much you want to assert. I was considering doing it as a regex match. I decided that I wanted to asset that the method output the exact message, no more and no less. It could be a looser test, and allow for other tests to check other behaviors.

I probably would do that in practice when I added more behavior to the method.

Decisions that aren't trade-offs never instigate an interesting debate.
jyurek
Extended or replaced, the behavior *changed*, and that's what tests are meant to detect, after all.

I think Nick's example highlights exactly why overmocking is a bad thing. IMO, I don't think this is a failure (per se) of BDD, RSpec, or even Mocking as a concept, it's just not the best test for the situation.
jollyjerry
I definitely agree with your point that mocking and stubbing can easily be taken too far. What I hate the most is when I see an entire chain of internal implementation stubbed out (foo.hidden.internal.method.baz). It ends up coupling your tests to your implementation, and since tests are also code, you make your code spaghetti that much more tangled. I'm just as guilty as the next guy when it comes to thinking abstractly about what the functionality should be, rather than reduplicating my code in test form. Like dchelimsky said, it's a hard problem!
Evan Light
Agreed -- including that perhaps you should be using an "assert output.read =~ //.

This is why, as a general rule, I have learned to largely avoid mocking and stubbing unless I am writing unit tests against an external system.

Yes, test doubles are great for isolating the unit under test. However, that isolation comes at a cost.

Should an external dependency change, your test doubled tests will lie to you! The test double is blissfuly unaware that the interface it has doubled has changed. Therefore, your test will inaccurately pass. Under such circumstances, I want my tests to fail loudly!

In my experience, this has resulted in significant time lost tracking down production application problems!

TL;DR White-box testing coupled with test doubles is dangerous. I strongly prefer an integration test that exercises the external dependency.
Evan Light
David: While Nick's test may have been brittle, at least the test would have failed as expected. Should the output string change, his test test fails. Should the implementation use print, his test fails.

The failure of the test is adequate to ensure that the test is revisited and updated.

Honestly, I admire so much of what you have done for the Ruby community. However, your tenacious adherence to mocking and stubbing as a routine TDD technique still utterly bewilders me. The brittleness of the resulting test hardly seems worth the benefit in almost any case except the rare external dependency.
myron.marston
Careless mocking can definitely make a test suite excessively brittle, but my overall experience with mocking has been the opposite. I used to practice a more classical TDD approach, where I used little-to-no-mocking and I wound up with bloated, slow, brittle test suites. After experimenting a bit and reading the RSpec book, I've started using mocking as a regular technique to achieve isolation in my unit tests and it's helped me create far less brittle test suites (not to mention they are much, much faster!).

It's definitely easy to shoot yourself in the foot with mocking (just like it is with ruby, git, or any of our other standard, powerful tools we use daily), but that doesn't mean it's not a useful, worthwhile tool: you just have to be aware of the dangers of over-mocking and mock with care.
Nick Gauthier
Test suite speed is not a valid argument for mocking.

http://grease-your-suite.heroku.com/
Evan Light
@Myron: How were your tests more brittle before introducing test doubles?

If you're test has clear preconditions, execution of the code under test, and clear expectations, I don't see how that leads to brittleness.

Can you give an example of one of your fragile pre-test double tests?
Jim Gay
I agree. If you're doing should_receive(:puts) then you're saying that calling "puts" is important.

All of this is behavior of the code, you just need to spec what behavior is important: the methods called, or the output. Sometimes it might be important that a method is called, but not in this situation.
Nick
I completely agree with Jim. It comes down to what behaviour is important, and thus needs to be tested.

If Codebreaker only cares that @output begins with "Welcome to Codebreaker!", then Nick Gauthier's approach is ideal.

If Codebreaker relies on #puts being called on @output , then the "bad test" is ideal.

As always, everything's contextual: what's important to you?
Nick Gauthier
@Jim @Nick

I totally agree. What I dislike is the tendency for people to write a test against methods being called, when that is almost always not what is important in the test.

Continuously asking "What am I testing and why am I testing it" is very important to writing good code and good tests.
Nick
> Continuously asking "What am I testing and why am I testing it" is very important to writing good code and good tests.

Hear hear!
tooky
If you are using mocks and stubs in your unit tests then you need to have a good set of acceptance/integration tests as well.

I like using mocks and stubs in my unit tests as they allow me to isolate the area of the code I'm working in, and I don't have to worry about side-effects or complex set-up of other objects. I also use cucumber though, this gives me great black-box coverage of the system.

So while changing the original example will make this spec break, it won't make any other specs in the system break, because no other specs will depend on its behaviour. So long as the behaviour is preserved it won't break the acceptance/integration tests either.
Pat Maddox
The mock-based example specifies the interaction with the collaborator. It demonstrates that you can pass in any object that responds to #puts(str) and expect it to work.

Your refactoring is valid -- when the output object responds to #puts, #write, #print, and #<<, and implements them all in basically the same way.

You are free to change the internals of a method however you want, but remember that certain changes will have broader-reaching effects than others. Changing how you build up the string to, say, an array of strings that then get joined, should certainly be encapsulated. Changing how you call collaborators is a totally different thing -- you're changing the contract, potentially in a way that's incompatible with existing clients.

Focused examples (or unit tests, or microtests, or whatever) help answer the question, "what do I need to know to use this code?" When there is a collaborator involved, one of the key things to know is the protocol that the collaborator must adhere to. An interaction-based example makes that protocol explicit. Moreover, it establishes a checkpoint. Should you decide to call #write instead of #puts, you have to consider the possible implications to existing code.

I understand where you're coming from. When you run the program, there is no externally visible difference in behavior. That is why we tend to separate the duties of checking program behavior into acceptance tests. Your unit tests then become a place for you to check an object's correctness, which includes its own logic and its interactions with other objects. Don't assume that because the external behavior of the program stays the same, that the behavior internal to objects has stayed the same as well.
admin
tl;dr of Pat's comment: I'mma let you finish, but the difference between unit tests and acceptance tests is one of the best differences of all time.
Nick Gauthier
haha thanks @admin.

@Pat I understand that is the case when you are unit testing between collaborators where specific method invocation is important. However in the example I used, it was clear to me that the caller was not a collaborator but a higher level controller or supervisor. Actors at a higher level of abstraction should not be concerned with the implementation details but rather the results.

To sum it up, I interpreted this as an acceptance test.
dchelimsky
Evan - there is a chapter in The RSpec Book on Mock Objects that talks about when to use them and the potential pitfalls (including the one that Nick brings up here).

Mocking external dependencies is actually considered to be a mocking no-no, because then you create a binding between your app and the external dependency's API. Instead, write a thin wrapper around that dependency that speaks in the domain of your app, and mock that as a means of isolating your app's code/specs from that dependency.

But that is only the tip of the iceberg when it comes to use cases served well by mocks. Others include: non-deterministic collaborators (intentionally random behavior) of collaborators), polymorphic collaborators (no need to have 3 examples when 1 would do), protocol definitions (where which methods are called represent adherence to a contract), collaborators that don't even exist yet, etc, etc.
Nick Gauthier
@dchelimsky

I read your slides from rubyconf and I agree with your stance on mocking. However about 95% of people that copy rspec examples don't understand the implications and often code themselves into mocking hell. They are my target audience :-)
dchelimsky
"To sum it up, I interpreted this as an acceptance test."

Part of the aim of the book is to make a distinction between application and object behavior: Cucumber is well suited to specify things at the application level (end to end tests/acceptance tests/customer tests, etc), and RSpec is well suited to specify things at the object level (unit tests, micro tests, developer tests, etc).

The example you cite comes from a chapter on describing the behavior of objects with RSpec (i.e. unit tests), after a chapter on describing the behavior of applications with Cucumber (i.e. acceptance tests), and is not at all intended to be an acceptance test. Hope that clears at least that part of this up for you.
dchelimsky
@nickgauthier - I appreciate wanting to help people understand the subtle waters of mocking. Unfortunately, the title of this post is not in any way nuanced or subtle. If your target audience is people who blindly copy examples without thinking about their implications or trying to learn to understand them, it's likely that they won't actually read this full post, much less the thoughtful discussion that follows.
Nick Gauthier
If I had to change the title, it would be expanded to "Everything that is wrong with how everyone uses mocks in rspec to do bdd"

Every time I've encountered mocking in test code (aside from 3rd party calls) I've seen it fall under the category of the content of this post.

While you describe a sound way to use mocks to test object behavior, I have never seen it executed properly. This is not to say it cannot be done, but that everyone is doing it wrong.

And yeah, you can't defeat the cargo-cult.
Evan Light
@David: I'm familiar with that practice RE: APIs. You took me a little too literally. I always abstract them a level.

I'll have to take another gander at the (finally ;-) ) completed RSpec book. I bought the beta a very long time ago and haven't read it much since. I'm curious about what you have to say about mocking.

However, I've generally found white box testing to be too implementation specific. Granted, I realize that there are almost "camps" among TDD'ers: the 'behavioral statists' that I likely fall under, the 'unit mockists' that you likely fall under, and others. I tend to prefer writing acceptance/integration tests with only supporting unit tests as necessary to help me "fill in the gaps" where the feature is too complex to develop without unit tests to handle complex logic (see http://evan.tiggerpalace.com/articles/2010/12/02/stop-writing-code-you-dont-need/).
zenspider
@ngauthier:

Via minitest (or test/unit in 1.9):

def test_start_output
assert_output "Welcome to Codebreaker!\n", "" do
Game.new.start
end
end

(blogger.... you suck. no pre tag allowed? no formatting help? bite me)

@dchelimsky: the rspec code would break too because EVERYONE uses .once. And remember, since we do test-first, it wouldn't break because we changed the output, it'd break because we added additional assertions to the test that we then needed to make pass. That isn't brittle. That's process.
myron.marston
@Evan: Re: "Can you give an example of one of your fragile pre-test double tests?"

The code base I'm thinking of was a rails app, and simple changes to models (i.e. adding a new required field or changing a validation) would cascade across the test suite and break lots of other tests because those tests depended on the existing behavior of the model when they shouldn't have. For example, a controller functional test that passed in valid attributes and would assert some result failed because the passed attributes were no longer valid. You may argue that the breaking controller test was a good thing, but I don't agree. I like to have lots of fast, isolated unit tests (which do use mocking carefully), as well as some full-stack integration tests (which don't use mocking). In this case, a change to a model my break a unit test and an integration test, but it's not going to cascade across my test suite.

Having controller tests that depended on the knowledge of what constituted a valid record was a very brittle approach. Mocking has worked well to help me unit test my controllers.
dchelimsky
@zenspider - that works assuming you have a test named test_start_output and it's the only test ever written that cares about the output of the start() method. As soon as additional context requires that start provide different output under different conditions, either this test would start to get unruly (too many states/assertions), or new tests would emerge that, in making them to pass, might cause this one to break.

re: the rspec code and "once" - in fact, it breaks just 2 or 3 pages later in the book, and a wonderful learning opportunity emerges. Hooray for examples taken out of context!
Sean DeNigris
Of course, none of these pitfalls have anything to do with BDD or RSpec. Just saying...
Pat Maddox
Can you please gist a file that demonstrates that the spec passes even when you include the insulting message to the user? Here's what I got: https://gist.github.com/730609
Nick Gauthier
@Pat ah you're right, since it's a double it will complain about extra method calls. This is why mocking becomes annoying when your method does a few things with its input. Any test will need an elaborate mocking setup.
dchelimsky
@Nick - a page or two later in the book we add "as_null_object" to the double, and the need for an elaborate mocking setup is eliminated.

That said, we've all been missing the deeper problem here, which was recognized immediately by Nat Pryce and reflected in these two tweets:

http://twitter.com/natpryce/status/12029918607048704

http://twitter.com/natpryce/status/12030541293424641

We've been talking about pain in the example, but none of us have "listened to the test". The real problem here is that there is a missing abstraction. Change "output" to "reporter" (or similar) and change "puts" to "message" (or similar) and the refactoring concern you raised goes away, everything is "speaking" in the domain, and all is right with the world.
Nick Gauthier
@dchelimsky I think the underlying concept still stands.

A mocking unit testing strategy encourages testing the implementation of a method because it's easy. My argument is that it doesn't get you very much in the way of code validation.

When I write tests, I think about the effects of my code, and then I validate that the effects occurred.

For example, consider testing user registration. Which of these scenarios below is most useful?

When I register on the site

=>

1) Then the database should be asked to insert a user
2) Then there should be 50 users
3) Then there should be an additional user
4) Then I can log in

I like #4, because it focuses on the input and the ultimately desired result and goal of a feature. This user-acceptance, and not unit testing.

But in unit testing I like to focus on the same goals and ask the same questions.

For example, if I'm implementing a "users sorted by join date" method, I can see two ways of going about it:

1) Mock the user model and ensure that the storage adapter is called with the appropriate method and parameters to return a sorted list of users.
2) Create a couple of users, call the method, and see if the results are sorted.

#1 is white-box and #2 is black box. I'll pick black box every time because it focuses on my goals, and not my process.



Regarding the tweets:

Yes, it's easy to blame the tools, but what if a particular tool can be easily misused? In the best of hands, it is revolutionary and powerful, but in most people's hands it's destructive. For example, morphine.

I think mocking is being abused and overused and is causing people pain in the long run. It's enticing because its simple to implement, but I find the assertions it makes to be weak and brittle.

I will agree, however, that in cases where you have a stable API that mocking can make your life easier (especially in large and complicated codebases). However, I've always found that it's not that hard to create helper methods to setup elaborate scenarios in which to do a real goal-based test. There are great gems out there like cucumber and factory girl that allow you to encapsulate complicated functionality in simple steps.
dchelimsky
@Nick - "but what if a particular tool can be easily misused? In the best of hands, it is revolutionary and powerful, but in most people's hands it's destructive. For example, morphine."

By that logic, we should all stop using Ruby :)

re: the "sorted users" example, I agree that is likely better specified with a more black box approach.

The distinction is that, in the sorted users example, the User class returns the object (a list of users) on which expected outcomes can be specified.

In the output example, the outcome is specified _on a collaborator_. Whether we specify that as an interaction or post-action state-based verification, the example is still bound to the collaborator object, so there is little difference in terms of dependencies and brittleness.

Using a test double for "reporter" gives us the freedom to introduce different "reporter" solutions (command line, text file, web page, etc) without these examples failing. In that context, the use of a test double and a message expectation is actually less brittle than the alternatives.
Nick Gauthier
@dchelimksy no, I'm not saying stop using ruby. I'm not saying stop using morphine in hospitals either.

It's more of an education perspective. I want to show people how they should and should not use tools so they don't dig themselves into a hole.


I think we're in agreement that when a collaborator's interface is well defined and stable that mocking is a good solution.

The problem is that whenever I see examples of mocking it's always in (what I deem to be) inappropriate scenarios.

If you showed a reporter class with a simple API of reporting messages, then showed the Codebreaker class working with a mocked Reporter, I'd be much happier with your example. I worry about a generation of rubyist who will mock their test suite to hell and then be stuck in a broken world.

Of course, anyone can write bad code in any framework, but I want to teach people the appropriate uses with solid examples.

I was very disappointed to see this example in the excerpts of the rspec book, because I've had to refactor and fix code like that and it's not fun.
dchelimsky
Well you've got me there. This is clearly a bad excerpt and I'll have it changed soon. I hope you'll consider reading the rest of the tutorial in spite of the excerpt. I'd welcome your feedback.

I think we have the same goals in terms of deepening the understanding of when/where to use different tools. I don't think we agree on everything here (I don't "always" prefer anything), but I think we probably agree on more than we disagree on.

Cheers,
David
blog comments powered by Disqus