Flaky automated tests are the bane of software teams. They lead to developers and managers questioning the value of the tests, and in some cases, developers might ignore or even refuse to use those tests. We all know that tests are supposed to pass if there are no bugs, and I think it’s a slam-dunk argument that we should write tests in a way that gives them the best chance of this being true. Do you want to remove test flakes? Great. In this post, I’m going to talk about a single thing you can do in your automated tests (if you’re not already!), which in one clean sweep, will make your automated tests:

  1. More reliable
  2. Faster
  3. More diagnosable
  4. More maintainable
  5. More likely to find bugs
  6. Easier to understand.

As the title suggests, I’m talking about removing blind sleeps from your test code. Consider the following python code:

Flaky Test

def test_some_website():
  open_web_page()
  log_in()
  sleep_duration = 1
  sleep(sleep_duration)
  do_something_on_logged_in_page()

def log_in():
  fill_in_details()
  press_login_button()
  
def do_something_on_logged_in_page():
  check_for_some_text_that_should_appear("You have logged in!")
  # Do some other checks, perhaps....
  print("Excellent, everything worked")

I’m amazed how often I’ve seen code like this written by experienced test or developer professionals. It’s easy to write, and given the correct sleep duration, the test will probably pass – at least on your local machine, so you can see how people might do it, but hopefully it’s fairly obvious what some of the problems are if you take a moment. Before talking about the problems in detail, let me introduce just one format of a proposed alternative:

Test with a wait

def test_some_website():
  open_web_page()
  log_in()
  do_something_on_logged_in_page()

def log_in():
  fill_in_details()
  press_login_button()
  wait_for_something_that_proves_we_logged_in()

def wait_for_something_that_proves_we_logged_in():
  # Do some/all of the following:
  # Keep checking for the user name in the top corner?
  # Keep checking for some logo you know appears only on the logged in page.
  # Check logs
  # Must check as frequently as is reasonably possible.
  # Must throw an exception if it didn't happen in a very reasonable period of time.
  # If 1s is reasonable, then perhaps 5s? This is a choice depending on your circumstances.
  checks_passed = false
  timeout = 5
  start_time = time.time()
  while not checks_passed:
    checks_passed = do_the_check()
    if time.time() > start_time + timeout:
      throw Exception("Not on login page: Timed out.") # Ideally add more details!
    # Tiny sleeps within a tight loop may be reasonable, if, for instance, 
    # you don't want to monopolise the CPU or overuse an API).
    sleep(0.1) 
  print("Fantastic success!")

Now the code I’ve suggest above is far from perfect, and will rapidly be a candidate for helper functions, or may be entirely unnecessary if your chosen tools/utilities offer their own wait_for_x() functions. There are also many languages/packages that will allow you to do things like await function(). All of this is beyond the scope of the article. My point here was merely to demonstrate (in a fairly language/tool agnostic manner) how few lines of code are required to throw together a loop that will do waiting if necessary, and to discuss the benefits of this and other methods of waiting, since done correctly, they ought to achieve similar things. With that being said, let’s discuss the benefits of this approach over the sleep(1) in the original code.

More reliable

The most obvious improvement, is that we no longer fail if the login takes 1.1s when we suggested sleeping 1s. Now you might consider that if it took 1s to be a bug (and that might be fair in your scenario!), and we’ll discuss the argument against that sleep(1) in that case in the “Faster” section of this article. For now, let’s just consider that 1s is reasonable, but actually, you wouldn’t mind if it took 2s. For a login page that seems a bit long, but could easily be true for other workflows where there is more going on behind the scenes. In the new code, we’d flexibly wait for as long as we want to define, and can actually choose a really large timeout to accommodate slow test machines, network conditions or workflows.

Faster:

As we just touched on, if you have a blanket sleep(1), whether the page actually loads in 0.1s or 0.9s, your test will stall for the full 1s. It may not seem like much, but over large test suites or (God forbid) long tests, each of these extra waits can add up and contribute to slow test times. Good tests should proceed as soon as it is reasonably possible, which means you need to CHECK when that is an unblock processing. It’s worth saying that with the sleeps, you are forced to choose a balance between reliability and performance. If the login is expected to take half a second, you could add a sleep(5) to be pretty confident your test would be reliable, but it would be slow. Conversely, sleep(0.6) would be fast but unreliable. Waits offer the best of both worlds.

More Diagnosable:

Consider that the first iteration of the test fails inside do_something_on_logged_in_page(). How do you know what the problem is? You don’t necessarily know whether the test was slow, the “Login” failed, or whether the login was successful and looks different. By adding extra checks in the wait function, you’re adding information that the person diagnosing the tests can use. Let’s consider how the new version of code treats those three failure types:

  1. Login was slow: We wait for a long time, so firstly we should see considerably fewer of these. In the case where we do see it, we throw an explicit exception which states “Not on login page: Timed out.”
  2. Login failed: Would look identical to above in the current form, however, you now have a single function where you can add additional debugging to monitor for error messages (see below) to make this more diagnosable. This improved diagnosability will be inherited by any tests using the wait.
  3. Login page doesn’t look as expected: You’ll at least know this is the case for sure, since the wait succeeded, and we have a simple log statement “Fantastic success!” to prove it. Hopefully you’ve added good logging to the check_for_some_text_that_should_appear() function to help diagnose what went wrong!

In the second point, I suggested an additional improvement to show the difference between “slow” and “failed”. Let’s see:

while not checks_passed:
  checksPassed = do_the_check()
  if time.time() > start_time + timeout:
    throw Exception("Not on login page: Timed out.") # Ideally add more details!
  login_errors = get_log_in_errors_if_there_are_any()
  if login_error is not None:
    throw Exception("Got unexpected error! %s" % login_errors)
  # Tiny sleeps within a tight loop may be reasonable, if, for instance, 
  # you don't want to monopolise the CPU or overuse an API).
  sleep(0.1) 
print("Fantastic success!")

Now all that remains is to implement get_log_in_errors_if_there_are_any() to be able to return any errors that are found on the page. Each time the wait loop checks to see if we’ve succeeded, if we haven’t, it will check to see if there were any explicit failures. Combined, we’re providing massively more information whenever the test fails, making it much easier to diagnose failures.

More maintainable:

If tests are easier to diagnose, they are necessarily easier to maintain! You won’t spend as long trying to work out what went wrong with the tests, which is usually the most time-consuming part of fixing them! However, as we demonstrated, there is another reason these tests are now more maintainable: Since we’ve defined an explicit wait function in wait_for_something_that_proves_we_logged_in(), we have a single place to go an modify if things go wrong or change, rather than having to change any number of different sleeps to go and change in various tests. This benefit actually extends further, because once you create this function, you and other people are more likely to use your function rather than write their own checks/loops/sleeps.

More likely to find bugs:

Waits rely on checking that something has happened, whereas sleeps do not. By adding waits instead of sleeps, you stop assuming and start knowing that what you expected to happen, happened. And after all, isn’t that the point of testing? That means you’re more likely to spot bugs. But yet again, there’s more! In a world where your tests are less flaky because you’re using sensible waits, people are more likely to trust them if something goes wrong, which means you might be able to find flaky issues in the software! The tests must be more reliable and trustworthy than the software if you ever hope to catch mysterious issues. I’ll say the word “trust” again. It’s hard-earned and easily lost, and oh so valuable. If your team does not trust your tests, they might as well not exist, and you’d probably be better off testing things manually.
I’ve personally caught a substantial number of important but elusive software issues by being able to run and re-run my tests until they fail and demonstrate a bug. That’s not possible (or is much harder) if the tests frequently fail for other reasons, and that’s assuming you ever spotted the bug in amongst the “expected failures” in your CI runs in the first place!

Easier to understand:

I hope this one is self explanatory, but with nicely named functions, we now know exactly why we’re “pausing” in the test. Sleeps don’t tell you, and adding comments doesn’t always help. A wait function is explicit about the assumptions that the test is making, and there is no room for misinterpretation, since any onlooker can simply follow the code to work out what your intention was. This makes things like code review easier, opening tests to further code quality recommendations.