Too Many Retries Erode the Quality

In one of the teams that I have worked in, our code was full of retries. To name a few:

When the code was looking for a storage account, it would retry up to 30 times when the storage account didn’t exist. The whole retries took 13 minutes.
When the code was looking for a setting value which didn’t exist, it would also retry 30 times which took 13 minutes;
When the code was trying to read an XML file from a file share, it would retry up to 30 times for 13 minutes when the XML file didn’t exist.

That was too much. When a system has too many retries for too many times all over the place, the worst consequence isn’t that it wastes time. The worst consequence is that it erodes the quality of the system in an irreversible way. I have seen some groups where people were used to simply adding more retries and retry for more times when they run into intermittent issues. Lots of genuine code bugs are intermittent in its nature. Adding retries will cover these bugs up. But over the time, as the code gets reused, as the system scales, the only way to maintain the same level of availability is to add more and more retries. It’s like only reinforcing the siding of the house when its frame is being eaten by termites. Adding retry is simpler and less costly than getting to the bottom of intermittent issues. It takes some courageousness to not go down the easier (but irreversible and poisonous) path.

My rule of thumbs for retry is that we should not retry, or only retry for very few times, in the following situations:

AuthN/AuthZ failure
HTTP 404 or similar error code indicating resource not found
HTTP 400 or similar error code indicating bad input
Timeout

In most of the time, if a resource is not found, it’s not found. Retry won’t help. The resource is not going to show up all of a sudden several minutes later^[1]. In general I also avoid retry on timeout error. Timeout indicates something has become very slow. Maybe the database is too busy. Maybe the frontend is under high load. Retry on timeout will likely make things worse.

The situations where I think it’s necessary and useful to do some retries are:

Network glitches, such as “System.ServiceModel.CommunicationException: The socket connection was aborted”
HTTP 500 or similar error code that indicates a server side error
HTTP 503 or similar error code that certain systems use in throttling. Sometime the response will include some guidance for the client: whether the client may retry, how much time should the client backoff for, …

It worth pointing out that retry should be mainly used to smooth out glitches^[2], rather than for handling outage. If a service/resource has problems for more than a few seconds, it’s an outage. When there is an outage, the caller should treat it like an outage: mark the operation as failure, write necessary logs, tell its caller that things have failed. Usually I frown when I see a code spend more than a few seconds to retry.

Another often debated topic about retry is where shall we put the retry. My take is: either as close as possible to where the failure happens, or as close as possible to the originator of the bigger operation. The benefit of putting retries close to where the failure happens is that you won’t have to worry about idempotency and side-effect too much when your code retry. But you have to be very careful with the footprint of the retry. Excessive retries in the lower level code can be amplified when it’s placed in the whole system.

Last but not least: do not have a default retry count and interval in your code base. In the examples provided above, they all retried 30 times for 13 minutes. That’s because they all use the internal Retry lib with default values for retry count (which was 30) and interval length. To make it worse, it spreads when people copy and paste code (they do!). Not having a default retry count and interval length will enforce the developers to make conscious choice every time about how many times they want the code to retry.

[1] Retry is different than polling. We use polling when we expect to see changes. It’s OK to spend several minutes or even longer to wait for things to happen.
[2] People also use other terms like “hiccups”, “blips”, “transient error”.

zhengziying.com

zhengziying.com

Too Many Retries Erode the Quality

Leave a comment Cancel reply