After Automation Ate Testing

Huseyin Dursun, my previous manager, recently wrote a post “Automation eats everything …”, in which he pointed out that manual validation has been eliminated and technology companies are no longer hiring engineers exclusively for testing role. That’s exactly what happened last year in my group, Microsoft Azure. We eliminated test and redefined dev and now we only have software engineers, who write both product code and test code.

Now we have eliminated manual validation and all tests are automated. What’s next? My answer is: more automation. Here is a few areas that I see where we are/will be replacing other human work in the engineering activities with software programs.

1. Automation of writing test automation

Today, test automations are written by engineers. In the future, test automation will be written by software programs. In other words, engineers will write the code which writes test automation. One technique to consider is the model based testing. The idea of MBT has existed for nearly two decades and some companies (including teams in Microsoft, including my own teams) have tried and have got some successes. But by and large, it’s very under-used, mainly because other things aren’t there yet, like the scale, the demand, the maturity in other engineering activities^[1], the people, etc..

Another direction that people have been pursuing for at least a decade is the traffic bifurcation. The idea is to run the test instance as a shadow copy of the production instance, duplicate the production traffic to the shadow copy and see if it handles it in the same way as the production copy does. The bifurcation could be real time, or more in a record-and-replay fashion. Twitter’s Diffy is the latest work that I have seen in this direction. I guess there is a long way to go, especially when the SUT is very much stateful and its state has strong dependencies with the states in other downstream systems.

2. Behavioral contract enforcement

Using contracts to define system boundary and doing implementation against contracts is now very common. However, our contracts are mostly about the data schema: the API signature, the structure of the JSON object in the input parameters and response bodies, the RESTful API URL, the WSDL for XML Web Services, file format, response codes and error codes, … These contracts don’t carry much information about the behaviors: how will the entity transit through its state machine, whether an operation is going to be idempotent, whether I must call connection.Open() before doing anything else with it, etc.. In particular, the behaviors related to time. For example, this asynchronous operation is supposed to complete within N minutes; the system will perform this recurring operation every X days; …

Today the behavioral contracts are mostly written (if ever written) in our natural languages in design specifications. The enforcement of such behavioral contracts are done in automated test cases. But there could be some fatal gaps in today’s way. Our natural language is ambiguous. Test cases may not cover 100% what’s written in and implied by the design specification. A more fundamental challenge is that the intention of the automated test cases may drift away as time goes by, meaning: our test automation code use to be able to catch a code bug, but after test code changes and refactoring, one day it will no longer be able to catch the same bug. I don’t think we have a good way to detect and prevent such drift.

I believe the direction is to write the behavioral contract with some formal language, such as the TLA+ specification language created by Leslie Lamport. In a presentation last year, he explained how TLA+ works and how it’s used in some real work. It seems pretty intriguing.

3. Automation of the analysis

In my previous team, as we made the automated tests faster, we found that now the long pole became the time human spent to make sense of the test result. So we developed some algorithms and tools to help us: 1) differentiate whether a failure is a new regression, or just a flaky test, 2) which failed tests are likely to share the same root cause. That was very helpful. In addition, we plan was to totally get rid of signoffs and let the software programs to make the call most of the time.

4. Automation of the workflow

Ideally once my code has left my desktop, the entire desktop-to-production journey should be led by software programs with no human participation (except for intervention/override). Today some companies are closer to that dream (e.g. Netflix’s Spinnaker) and some other companies are farther away. Some smaller/simpler products may have already achieved it, but it remains a challenging thing for complex products. Today CI/CD is a lot more common in the software industry than ten years ago. But in my eyes today’s CI/CD tools and practices more like the DHTML and AJAX things circa early 2000’s. The jQuery/Bootstrap equivalent in CI/CD has yet to come.

5. Integration test in production

Besides replacing more human work with software programs, there is one more thing that we can do better in the test engineering: eliminate the test environment per se and perform all integration tests in production^[2]. Integration test is an inevitable^[3] phase between passing unit tests and getting exposed to real customers in production. Traditionally in integration tests, the SUT and most of its dependencies runs in the lab that are physically separated from the production instances. There are several big pain points in that approach: a) fidelity^[5], b) capacity, c) stability, d) support^[6]. Doing integration tests in production will make all these problems disappear. Needless to say, there are some challenges in this, mainly regarding product architect, security and compliance, isolation and protection, differentiation and equality, monitoring and alerting, etc.. I guess next time I will write a post about “The Design Pattern of Integration Testing in Production“.

[1] For example, a team should invest in other more fundamental things like CI/CD before investing in building the model and doing MBT.
[2] “Testing in production” is a highly overloaded term. Someone uses it to refer to A/B testing. Sometime it means a late stage quality gate where the new version is rolled out to a small % of production and/or exposed to a small % of customers. “Integration test in production” is different on two things: i) it’s for low quality code that is still under development, ii) it doesn’t get exposed to customer.
[3] There are some strong opinions against integration tests. The lines like “integration test is a scam” help highlight some valid points. But practically we shouldn’t throw the baby out with the bath water. I am strong believer of “pushing to the left” (meaning: put more tests in unit test and find issues earlier) but I too believe integration test has its place in the outer loop^[4]. Even though in the hindsight it might be very obvious that some bugs could have been caught by unit test, it would be a totally different thing when these bugs were unknown unknown.
[4] Outer Loop is defined as the stage between when an engineer has completed their check in and when it has rolled out to production. Depending on the product, this could mean App Store deployments (Mobile) or worldwide exposure (Services and modern Click to Run applications).
[5] Lab is different than production in many ways: configurations, security settings, networking, data pattern, etc. Those differences often hide bugs. Lab doesn’t have all the hardware SKUs that production has, which significantly limits how much we can do in the lab in hardware related testing (e.g. drivers, I/O performance, etc.).
[6] Let’s say the SUT depends on another service Foo. So traditionally in the integration test, we also have Foo instance(s) running in lab, too. When the lab instance(s) of Foo has any issue, the team of SUT will need the team of Foo to help check/fix. But that would be a lower priority for the team Foo, compared to the issues in the live site (production). Plus, the SLA (service level agreement) for lab instances is usually less than 24×7, but we want our integration tests to run all the time.

zhengziying.com