Flaky is a test tool that Box shared with the community about a year ago. In my opinion and own experience, the tool solves the problem it wanted to solve, but it’s a poisonous tool to have for any engineering organization who wants sustainable success. Here is the comment that I left on their blog post:
I was in the same spot in the past: a few years back, we’ve had tests that failed intermittently and the causes sometime were external to our component, but in other services owned by folks in the other floor/building. Naturally, we built a similar way to rerun the failed tests automatically.
Later, I found a problem in this approach: it discouraged people from doing deep investigation and hide some bugs in our own component. Discourage: as long as the rerun passed, no one would look at the failure in the first run. Hide bugs: there are product bugs that have intermittent nature (rather than causing some functionality 100% failing); also, there were test automation bugs that genuinely cause intermittent test failures (lack of test repeatability, or test automation reliability issue). It was also a slippery slope: over the time, the amount of rerun increased (since no one spent time looking at why the tests failed on first attempt), which caused the total duration of the test pass to increase.
Seeing those problems, I stopped the rerun. I told the team because our component has deterministic nature (rather than fuzzy logic products like face recognition, speech, relevancy, machine learning, …), our tests should be deterministic and highly repeatable. I forced the team to investigate every intermittent failure. It turned out that we found a lot of genuine issues in the product code as well as the test automation. We fixed them. I also tenaciously push all the teams (not only my own team, but also folks on other floors/buildings) to improve the design/architecture so that it’s easier to write more repeatable test automation. It paid off pretty well. After about 1 year since stopping the rerun, the amount of flaky tests significantly dropped (from more than 5% of the entire test automation to <0.5%). The total duration of test pass dropped. People are less frustrated by dealing with intermittent failures all the time.
In short, having a tool to automatically rerun failed tests is poisonous. It makes life easier now, but sends your engineering toward the wrong direction.