Why I Chose to Have a Much Smaller Team

I used to be a M2 and manage a 30 people team. But in 2014, I became a M1 and only manage 3 people. It was a deliberate choice I made. I have been asked many times why I made such a choice, since such a choice seemingly hurts my career. I guess it’s a good idea to write the answer down if it’s asked repeatedly.

Here it is:

Azure shifted to the combined engineering model in 2014. Before that shift, we were in a functional model: I was the test manager and I had 30 people, including SDETs and test leads. I reported to a Director of Test, who reported to a VP of Test. The dev manager that I partnered with had 60 people, including SDEs and dev leads. He reported to a Director of Dev, who reported to a VP of Dev. After the shift, all individual contributors were now SWEs (Software Engineer). Their managers were Engineering Managers, who reported to Group Engineering Managers, who reported to a Director of Engineering, or a VP of Engineering.

Back in 2014, when the leadership pulled the trigger to make the combined engineering shift, I sat down with my dev manager. I told him that in my view, the best way to implement that change is to simply merge our two teams, put the SDEs and SDETs who used to work on the same projects and components under the same engineering manager, who might be the previous dev lead, or the previous test lead. That way gives us the best continuity, minimum tribal knowledge loss and minimum disruption to the projects.

The question was where I would be after the merge.

First to decide is whether I would remain in the team or not. For the benefit of the team, I chose to stay. Two reasons. For one, I had quite some tribal knowledge, as I was quite a hands-on manager (but not micro managing). If I leave right after the merge, those knowledge will be lost. For two, my staying in the team would demonstrate my commitment and support to this paradigm shift of combined engineering. That will help the team maintain the morale level and keep attrition low. These all serve the team’s best interest.

Now I had chosen to stay. The next question: what should be my team size.

I applied two principles. They are, not in any particular order: first, only one chef in a kitchen; second, less layers. If I remain in the same team and continue to have a large team, there would be only two choices: a) I remain as a “sibling” as the dev manager (now group engineering manager), b) I report to the dev manager. The choice (a) violates the first principle. The choice (b) violates the second principle, since managing a large team means I would have a couple leads. Therefore, in order to stay in the team and not violate my two principles, I should have a smaller team.

Last question: why only 3 engineers.

That’s at the low end of how many people a M1 usually manages. After the combined engineering shift, the top-down guidance was around 10-12 engineers for each M1. I went for a lower number for two reasons. First, I wanted to lower my managerial work load since the combined engineering was a new thing to me. That’s similar to that when I first came to US in 2010, I started with managing only one person. I picked up more after I got familiar with the new country, new culture and new working environment. Second, these 3 engineers were all top performers. I wanted to see what it is like to have a Navy SEAL team: small in number, but each team member is highly capable. That was intriguing as I never had such a chance before.

That’s the whole story.

I left that team in a year. By the time I left, knowledge had been shared, everybody had settled in their new roles and the transition to the combined engineering model was completed successfully a while back. Since then, I have been focused on a different area under the same VP. The choice I made in 2014 did disrupt my career growth a little bit, but I got back on track soon. In my current role, I still only manage 3 engineers. That’s mainly because it’s the optimal resourcing model for the problem space that I am working on. That will be a story for another day.

You Get Paid to Make It

Two months ago, the week 7 game between Seattle Seahawks and Arizona Cardinals ended in a 6-6 tie. During the overtime, the kickers from both team missed game-winning field goals: Arizona kicker Chandler Catanzaro missed a 24-yarder and Seattle kicker Stephen Hauschka missed a 28-yarder.

In the post-game press conference, the two head coaches, Arizona’s Bruce Arians and Seattle’s Pete Carroll, were asked about their kickers. Here is what they said:

  • Bruce Arians: “Make it. This is professional, this ain’t high school, baby. You get paid to make it.
  • Pete Carroll: “[Hauschka] made his kicks to give us a chance and unfortunately he didn’t make the last one. He’s been making kicks for years around here … but he’s gonna hit a lot of winners as we go down the road here. I love him and he’s our guy.

After the game, people were contrasting the two head coaches’ words. They think Pete Carroll demonstrated better leadership style, regarding how to respond when a member of my team makes a mistake. In one article, the author appraised Pete Carroll for having his kicker’s back.

The author said: “By choosing to focus on the positive, skillfully sharing your own personal experience, or simply reminding the person that everyone has a bad day, you do everything in your power to help that person recover.” The author then asked a rhetorical question: “The question is, can these two kickers come back from their mistakes?

Now, two months later, we have the answer: Arizona kicker Chandler Catanzaro recovered, Seattle kicker Stephen Hauschka hasn’t.

Today, in week 16, Seattle Seahawks met Arizona Cardinals again. They tied in 31-31 when there was only one minute left in Q4 and now Seattle was going to kick the extra-point. This time, Seattle kicker Stephen Hauschka missed again. A moment later, Arizona kicker Chandler Catanzaro hit a field goal and helped Cardinals win the game by 34-31. That’s a perfect redemption of Arizona kicker Chandler Catanzaro: the same rival, scores tied, facing a game-winning field goal. This time he made it.

Bruce Arians was right: this is professional, this ain’t high school. We get paid to make it. We are all grown ups. Why need to sugar-coat it?

What Digital Camera and Public Cloud Have in Common

This is a picture featured in the March 2016 issue of the Outdoor Photographer magazine:

2016-04-07-panorama

To take such a high resolution panorama picture, traditionally it needs a high-end large format or at least medium format camera. But this one was taken by using a handheld Canon EOS 1Ds Mark III with a Canon EF 24-105mm ƒ/4L IS USM lens at 50mm, which is, in layman’s terms, a mass market digital camera (although 1Ds Mark III is at the expensive end of that range). The composited panorama consists of five vertical images captured at ƒ/13 and ISO 400. We are seeing more such things in the digital photography world. Another example is the High Resolution mode of the Olympus OM-D E-M5 II camera. E-M5 II is a commodity camera: the sensor is smaller (Micro 4/3) and has only 16 megapixels. But it can shoot a 40 megapixel picture by shifting the sensor in half-pixel steps and capturing eight images over a period of one second. The moral here is, with the help of software, commodity digital cameras can achieve what could only be achieved by high-end cameras.

Replace “digital cameras” with “computer hardware” and that will be the spirit of the cloud computing. Cloud platforms, especially the public clouds like AWS and Azure, use commodity hardware to achieve what could only be achieved by high-end super computers and expensive networking devices. In this analogy, IBM’s mainframes and EMC’s storage systems are the large format and medium format cameras. Cloud platforms stitch together a bunch of commodity computers with the help of software, just like how George Lepp stitched together the pictures shot by a household DSLR, with the help of software, to produce a high resolution panorama pictures.

Having said that, in the digital photography world, there are still situations where we have to use high-end cameras. Take the panorama picture of the balloons as an example. If the balloons were some other fast moving objects, such as birds and buffalos, George Lepp’s technique wouldn’t work. He would have had to use a large or medium format camera to freeze all the motions in one single shot. We have seen similar situations in cloud platforms. There are still some situations where the computing has to be done in a single high-end computer. That’s why even in AWS and Azure, there are very high-end configurations of machines: Azure G-Series VMs come with up to 32 vCPUs, 448 GB of memory and 6.59 TB of local SSD space. The largest AWS EC2 instances are in the same neighborhood. However, local network speed is getting faster and faster: the best Azure VM type now supports 20 Gigabit network and AWS EC2 supports up to 10 Gigabit. The technology of 40 Gb and 100 Gb is already ready and 400 Gb is on the horizon. With faster Ethernet speed, more workloads can be scaled out, which wasn’t possible in the past due to the limitation of network speed.

When Code Review Becomes Annoying

Overall speaking, code review is useful.

From time to time, my reviewers did catch some bugs that I overlooked, or point out options that fell into my blind spot. Sometimes I learn new things in code reviews. For example, last year a reviewer suggested me use ThreadLocal<T>, which indeed would simplify my code a lot. Some code reviews aren’t the code review per se, when it’s more about a FYI: “Hi, I have written this code and I like you to be aware of and get familiar with it.” That happens when you are the only person in the team truly understand that piece of legacy code. Although such FYI type of code review isn’t that much helpful to the author, it’s still good to the team.

But sometimes code review can become annoying, especially when people spend time on things that (in my opinion) don’t really matter. For examples:

I understand there are differences, often very subtle or trivial though, between string vs. String, readonly static vs. const, etc.. But those differences don’t do any real harm. Explicit declaration, such as Stopwatch stopwatch = Stopwatch.StartNew(), doesn’t make the code harder to read[1] than var stopwatch = Stopwatch.StartNew(). String.Join doesn’t make the code slower than using string.Join. Putting using outside of the namespace block doesn’t make the code harder to work on. In addition, by default all versions of Visual Studio put using outside of namespace in the code it generates.

I really don’t like to spend time on those things in code reviews. I don’t think they matter to product quality and engineers’ productivity. There are so much more things that we wish we had more time to spend on to improve quality and productivity. Debating on those things are like debating whether to put one space after a period or two.

What people should do is to make sure that their team has collectively decided on what StyleCop[2] rules to turn on in their code base and get included in the build. Once that’s decided and has taken effect, there will be no debate any more: if the rules are violated, it will be a build error and we don’t submit code review until it at least passes build. Simple and clear.


[1] Readability is both objective and subjective. There is no doubt about that a line longer than 120 or 150 characters is hard to read and single letter variable names are hard to read. But whether Stopwatch stopwatch = Stopwatch.StartNew() is harder to read than var stopwatch = Stopwatch.StartNew(), that’s really personal.
[2] Or the equivalent of StyleCop in other languages.

Too Many Retries Erode the Quality

In one of the teams that I have worked in, our code was full of retries. To name a few:

  • When the code was looking for a storage account, it would retry up to 30 times when the storage account didn’t exist. The whole retries took 13 minutes.
  • When the code was looking for a setting value which didn’t exist, it would also retry 30 times which took 13 minutes;
  • When the code was trying to read an XML file from a file share, it would retry up to 30 times for 13 minutes when the XML file didn’t exist.

That was too much. When a system has too many retries for too many times all over the place, the worst consequence isn’t that it wastes time. The worst consequence is that it erodes the quality of the system in an irreversible way. I have seen some groups where people were used to simply adding more retries and retry for more times when they run into intermittent issues. Lots of genuine code bugs are intermittent in its nature. Adding retries will cover these bugs up. But over the time, as the code gets reused, as the system scales, the only way to maintain the same level of availability is to add more and more retries. It’s like only reinforcing the siding of the house when its frame is being eaten by termites. Adding retry is simpler and less costly than getting to the bottom of intermittent issues. It takes some courageousness to not go down the easier (but irreversible and poisonous) path.

My rule of thumbs for retry is that we should not retry, or only retry for very few times, in the following situations:

  1. AuthN/AuthZ failure
  2. HTTP 404 or similar error code indicating resource not found
  3. HTTP 400 or similar error code indicating bad input
  4. Timeout

In most of the time, if a resource is not found, it’s not found. Retry won’t help. The resource is not going to show up all of a sudden several minutes later[1]. In general I also avoid retry on timeout error. Timeout indicates something has become very slow. Maybe the database is too busy. Maybe the frontend is under high load. Retry on timeout will likely make things worse.

The situations where I think it’s necessary and useful to do some retries are:

  1. Network glitches, such as “System.ServiceModel.CommunicationException: The socket connection was aborted
  2. HTTP 500 or similar error code that indicates a server side error
  3. HTTP 503 or similar error code that certain systems use in throttling. Sometime the response will include some guidance for the client: whether the client may retry, how much time should the client backoff for, …

It worth pointing out that retry should be mainly used to smooth out glitches[2], rather than for handling outage. If a service/resource has problems for more than a few seconds, it’s an outage. When there is an outage, the caller should treat it like an outage: mark the operation as failure, write necessary logs, tell its caller that things have failed. Usually I frown when I see a code spend more than a few seconds to retry.

Another often debated topic about retry is where shall we put the retry. My take is: either as close as possible to where the failure happens, or as close as possible to the originator of the bigger operation. The benefit of putting retries close to where the failure happens is that you won’t have to worry about idempotency and side-effect too much when your code retry. But you have to be very careful with the footprint of the retry. Excessive retries in the lower level code can be amplified when it’s placed in the whole system.

Last but not least: do not have a default retry count and interval in your code base. In the examples provided above, they all retried 30 times for 13 minutes. That’s because they all use the internal Retry lib with default values for retry count (which was 30) and interval length. To make it worse, it spreads when people copy and paste code (they do!). Not having a default retry count and interval length will enforce the developers to make conscious choice every time about how many times they want the code to retry.


[1] Retry is different than polling. We use polling when we expect to see changes. It’s OK to spend several minutes or even longer to wait for things to happen.
[2] People also use other terms like “hiccups”, “blips”, “transient error”.

After Automation Ate Testing

Huseyin Dursun, my previous manager, recently wrote a post “Automation eats everything …”, in which he pointed out that manual validation has been eliminated and technology companies are no longer hiring engineers exclusively for testing role. That’s exactly what happened last year in my group, Microsoft Azure. We eliminated test and redefined dev and now we only have software engineers, who write both product code and test code.

Now we have eliminated manual validation and all tests are automated. What’s next? My answer is: more automation. Here is a few areas that I see where we are/will be replacing other human work in the engineering activities with software programs.

1. Automation of writing test automation

Today, test automations are written by engineers. In the future, test automation will be written by software programs. In other words, engineers will write the code which writes test automation. One technique to consider is the model based testing. The idea of MBT has existed for nearly two decades and some companies (including teams in Microsoft, including my own teams) have tried and have got some successes. But by and large, it’s very under-used, mainly because other things aren’t there yet, like the scale, the demand, the maturity in other engineering activities[1], the people, etc..

Another direction that people have been pursuing for at least a decade is the traffic bifurcation. The idea is to run the test instance as a shadow copy of the production instance, duplicate the production traffic to the shadow copy and see if it handles it in the same way as the production copy does. The bifurcation could be real time, or more in a record-and-replay fashion. Twitter’s Diffy is the latest work that I have seen in this direction. I guess there is a long way to go, especially when the SUT is very much stateful and its state has strong dependencies with the states in other downstream systems.

2. Behavioral contract enforcement

Using contracts to define system boundary and doing implementation against contracts is now very common. However, our contracts are mostly about the data schema: the API signature, the structure of the JSON object in the input parameters and response bodies, the RESTful API URL, the WSDL for XML Web Services, file format, response codes and error codes, … These contracts don’t carry much information about the behaviors: how will the entity transit through its state machine, whether an operation is going to be idempotent, whether I must call connection.Open() before doing anything else with it, etc.. In particular, the behaviors related to time. For example, this asynchronous operation is supposed to complete within N minutes; the system will perform this recurring operation every X days; …

Today the behavioral contracts are mostly written (if ever written) in our natural languages in design specifications. The enforcement of such behavioral contracts are done in automated test cases. But there could be some fatal gaps in today’s way. Our natural language is ambiguous. Test cases may not cover 100% what’s written in and implied by the design specification. A more fundamental challenge is that the intention of the automated test cases may drift away as time goes by, meaning: our test automation code use to be able to catch a code bug, but after test code changes and refactoring, one day it will no longer be able to catch the same bug. I don’t think we have a good way to detect and prevent such drift.

I believe the direction is to write the behavioral contract with some formal language, such as the TLA+ specification language created by Leslie Lamport. In a presentation last year, he explained how TLA+ works and how it’s used in some real work. It seems pretty intriguing.

3. Automation of the analysis

In my previous team, as we made the automated tests faster, we found that now the long pole became the time human spent to make sense of the test result. So we developed some algorithms and tools to help us: 1) differentiate whether a failure is a new regression, or just a flaky test, 2) which failed tests are likely to share the same root cause. That was very helpful. In addition, we plan was to totally get rid of signoffs and let the software programs to make the call most of the time.

4. Automation of the workflow

Ideally once my code has left my desktop, the entire desktop-to-production journey should be led by software programs with no human participation (except for intervention/override). Today some companies are closer to that dream (e.g. Netflix’s Spinnaker) and some other companies are farther away. Some smaller/simpler products may have already achieved it, but it remains a challenging thing for complex products. Today CI/CD is a lot more common in the software industry than ten years ago. But in my eyes today’s CI/CD tools and practices more like the DHTML and AJAX things circa early 2000’s. The jQuery/Bootstrap equivalent in CI/CD has yet to come.


5. Integration test in production

Besides replacing more human work with software programs, there is one more thing that we can do better in the test engineering: eliminate the test environment per se and perform all integration tests in production[2]. Integration test is an inevitable[3] phase between passing unit tests and getting exposed to real customers in production. Traditionally in integration tests, the SUT and most of its dependencies runs in the lab that are physically separated from the production instances. There are several big pain points in that approach: a) fidelity[5], b) capacity, c) stability, d) support[6]. Doing integration tests in production will make all these problems disappear. Needless to say, there are some challenges in this, mainly regarding product architect, security and compliance, isolation and protection, differentiation and equality, monitoring and alerting, etc.. I guess next time I will write a post about “The Design Pattern of Integration Testing in Production“.


[1] For example, a team should invest in other more fundamental things like CI/CD before investing in building the model and doing MBT.
[2] “Testing in production” is a highly overloaded term. Someone uses it to refer to A/B testing. Sometime it means a late stage quality gate where the new version is rolled out to a small % of production and/or exposed to a small % of customers. “Integration test in production” is different on two things: i) it’s for low quality code that is still under development, ii) it doesn’t get exposed to customer.
[3] There are some strong opinions against integration tests. The lines like “integration test is a scam” help highlight some valid points. But practically we shouldn’t throw the baby out with the bath water. I am strong believer of “pushing to the left” (meaning: put more tests in unit test and find issues earlier) but I too believe integration test has its place in the outer loop[4]. Even though in the hindsight it might be very obvious that some bugs could have been caught by unit test, it would be a totally different thing when these bugs were unknown unknown.
[4] Outer Loop is defined as the stage between when an engineer has completed their check in and when it has rolled out to production. Depending on the product, this could mean App Store deployments (Mobile) or worldwide exposure (Services and modern Click to Run applications).
[5] Lab is different than production in many ways: configurations, security settings, networking, data pattern, etc. Those differences often hide bugs. Lab doesn’t have all the hardware SKUs that production has, which significantly limits how much we can do in the lab in hardware related testing (e.g. drivers, I/O performance, etc.).
[6] Let’s say the SUT depends on another service Foo. So traditionally in the integration test, we also have Foo instance(s) running in lab, too. When the lab instance(s) of Foo has any issue, the team of SUT will need the team of Foo to help check/fix. But that would be a lower priority for the team Foo, compared to the issues in the live site (production). Plus, the SLA (service level agreement) for lab instances is usually less than 24×7, but we want our integration tests to run all the time.

The Combined Engineering in Azure: A Year Later

Last year in Windows Azure[1], we merged dev and test[2] and switched to the combined engineering model[3].

Recently I have been asked quite a few times about my view of that change. My answer was: it solved a few chronic problems in the traditional dev+test model. It solved these problems fairly easily and naturally. If we didn’t do the combined engineering change, these problems would still be here today:

1. Quality is everyone’s responsibility

We always said: quality is owned by everybody, not just the test team. In the reality, there were always some gaps, more or less. Some developers still had the mentality of “the test team would/should find the bug for me”. Now there is no test team. Software engineers can count on nobody but themselves.

2. Improve testability

Although nobody disagreed with the importance of testability design, often times testability is treated as relatively lower priority by the developers in the traditional dev+test model. When they were under the time pressure, they naturally get the feature implemented first and it took long time for some testability requirements getting honored. The worse was that the developers didn’t have the sense of testability in their mind when they design and write code. Quite some testability issues were found in pretty late stage when it’s too costly/risky to change the design and code.

Now writing test code is a part of the software engineer’s job. They have much strong incentive to improve testability because it will make their own work easier. Plus, they truly learn the lessons of poor testability designs because it hurts themselves.

No more begging to the developers to add an API for my test automation to poll to replace a hard-coded Sleep(10000).

3. Push tests to the left

I had hard time to convince some developers to write more unit tests. This is a true story: a dev in my team wrote a custom lock. I found that there was little unit test of that lock. I asked the dev. He told me he think the scenario tests[4] has already covered it pretty well. I didn’t know what to say. Yes we had code coverage data for unit test. But the hall of shame can only go this far.

Now developers (software engineers) own all the tests. Now they have all the incentives to push the tests to the left[5]: put as much as tests in unit test, because it’s fast, easy to debug and nearly free of noises. The integration test is obviously a less favorable place to put the test: it’s slow, more hassle to debug and more noisy.

4. Hiring and retention

That was really, really, really a challenge all the time. Most college graduates prefer SDE than SDET[6]. Partly because they had little exposure to what the SDET job is about. Partly because they are concerned with the “test” tag. Valid concern. Among the industry candidates, many of those who came from software testing background usually didn’t meet our requirement of coding and problem skills, because in many places outside Microsoft, test engineers were mainly doing what the STE[7] used to do in Microsoft. We ended up having to put a lot of effort in convincing developers from other companies to join Microsoft as SDET, which wasn’t an easy sell.

Now, voila, problem solved. There is no more “test” tag. Everyone is “Software Engineer”. No more SDET wants to switch to SDE to get rid of the “test” tag, because there is no more SDET.

5. Planning and resourcing

We used to do our planning based on dev estimate only. It was understandable. It’s much messier to juggle if every work item has two prices (dev estimate and test estimate). In planning, we assume that for every work item, the test estimate is proportional to the dev estimate (e.g. 1:2, which came from our total test:dev ratio) and we believe the variances in each individual work item will average out. It worked OK most of the time. But there were several times where such model cause significantly under-funded test resources and caused crunch in late stage in the project.

Now when engineering managers and software engineers provide work estimate, the price tag has already included both dev estimate and test estimate. Nobody would underestimate the test cost because they would have to pay for it anyway.


To summarize, that’s the power of the roles and responsibility model. In the past, I was the cook at our home and my wife usually do the cleanup. She always complained that I made the stove and counter-top very messy. Later we made a change: I do both cooking and cleanup (and she took some other housework from me). Then all of sudden I paid a lot of attention to not make kitchen messy because otherwise it would be myself that spend time to clean it up.

p.s. Of course there is also the downside of this change. That would be another topic. But the net is a big plus.


[1] I know I should have called it “Microsoft Azure” rather “Windows Azure”. It’s just the old habit. For us who joined Azure in its early years, we still call it Windows Azure.
[2] Before the merge, we had dev team and test team. Take myself as an example. I was the test manager leading the test team, partnering with the dev manager who led the dev team. My test team was about half of the size of the dev team. In the shift to combined engineering model, we simply merged and became one engineering team of about 70+ people.
[3] Strictly speaking, our shift to the combined engineering did not only include merging the dev and test, but also redefined the role of PM, which now lean toward the market, customer and competition more than the internal engineering activities, and enlarged the role of the new “software engineer” role (which started from the sum of original dev+test) by adding more DevOps responsibilities.
[4] We didn’t differentiate these terms: scenario test, functional test, e2e test, integration test. Our dev did help write quite some functional/scenario tests when test team was running tight. But by and large, the test team owned everything after unit test.
[5] We usually draw a timeline on the whiteboard, from left to the right: the developer changes code in his local repo -> unit test -> other pre-checkin tests -> checkin -> integration tests -> start production rollout -> rollout completed. So “push tests to the left” means push them into the unit test.
[6] SDE = Software Development Engineer. SDET = Software Development Engineer in Test (aka “tester”).
[7] STE = Software Test Engineer. Microsoft had this job title until 2005/2006. STE’s main job responsibility was writing test spec, enumerating test cases, execute test cases (mainly manually), exploratory tests, etc.. Many STEs had very good analytical skills, knowledgeable of our product and good soft skills, but relatively weak in coding, debugging, design, etc..

If You Pay Later, You Pay More

One of my previous managers used to tell us “You either pay now or pay later. If you pay later, you pay more”. Years have passed and I have seen how true it is for an engineering team.

The dilemma is: the one who chooses not to pay now may not be the same one who pays later. Why would I pay now, so that someone else wouldn’t pay more later? It’s natural thing that we make selfish choices, unless there is something to counter balance it.

Here is an example, a real live site incident that happened recently. Our customer couldn’t start the virtual machine from the management portal. The cause was in the following code, it threw NullReferenceException because roleInstance.Current was null:

foreach (RoleInstance roleInstance in this.RoleInstances)
{
    int currentUpdateDomain = (int)roleInstance.Current
                                               .Container
                                               .ServiceUpdateDomain;
    //...
}

When the developer pressed “.” after roleInstance.Current, he probably didn’t pause and ask himself: would the Current always be not null? He probably didn’t spend time to read the related code a bit to find out and put extra code there for safety (e.g. “if(roleInstance.Current!=null)“). If he did all these (the pause, the code reading and the additional code), he would be slower. But that would have saved so much more associated with the live site incident: people’s time spent on investigate the incident, the time to rollout the hotfix, and the time to handle the (unhappy) customer. But those time is not the developer’s time. By cutting some corners, he probably got a few more work items done. Thus, he probably got a somewhat better performance review and promoted a bit sooner. Then he moved on and leave the team behind to “pay later but pay more”.

Our performance review model doesn’t help, either. In the annual review cycle, we barely can hold people accountable for something they did more than a year ago. Once bonus are paid and promotions are done, unless it’s something really bad (like causing the subprime crisis), we are not going to take the bonus back or revert the promotion.

Among the things that we can do, one thing I did was to keep my team members’ ownership unchanged[1] for long time (e.g. two years, if not more) and told them so upfront. The benefits are:

  • By fixing people on the same thing for longer time, the one who chooses not to pay now would be more likely the same person who will pay later (and pay more).
  • By telling them so upfront, it does not only counter-balances the shortsighted cutting-corners, but also encourages the right behavior and investments in their areas that will lead to long-term successes. It’s like if I know I am going to live in this house for at least five years, I will spend the first year cleaning up the weeds and fixing the irrigation system in the backyard, then plant the plum trees in the second year and keep fertilizing and take good care of it in the third and fourth year, so that from the fifth year onward, I get to eat the sweat plums while enjoying the sun and breeze in my backyard.

That’s why re-org has a downside. In some companies where a re-org happens every 18-24 months, although the organizations get to more frequently optimize their structure and alignment, it also sets a norm that discourage long-term investments and successes: why bother planting the plum trees if I know I am going to move to another house every 18-24 months?

As Reid Hoffman said: “Good managers know that it’s difficult to achieve long-term success without obtaining long-term commitments from employees.”


[1] I usually did it in a mixed way: some fixed ownership + some flexibility of changing projects once a while.

They Are Not Tech Companies

I was listening to a podcast lately and they were talking about a tech startup, Wevorce, which disrupts the divorce market:

A system that works by attracting couples to the service, collecting data on them through an initial survey, and using their results to classify each person as a particular divorce “archetype.”

Then, the Wevorce team of counselors, family planners, and lawyers steps in. They use their research, data, and training to mediate at predictable moments of tension — a processing system kind of like TurboTax or H&R Block. 

How is that a tech company? What is the tech here? Is filling an online survey considered “using technology”? To me, that is a law company. A law startup. Not a tech start up. I fill a survey form when I visit a physical therapist for the first time. If that form is done online and they have an algorithm to analyze my profile to recommend the best therapist and treatment plan, is the hospital considered a tech company? Of course not.

To me, tech companies are those who advance the technologies and make innovations in technology. If a company makes innovation in another trade, using the help from the latest technologies, it’s not a tech company. For example, Blue Apron is not a tech company. They are a meal kit company. It is still a great startup, a great business innovation. I am a customer and I like it.  

For the same reason, Instacart, of which I am a customer too, is not a tech company either. They do provide a new experience of buying groceries. But at the end of the day, they are a grocery store. An online grocery store. Putting a storefront online and providing an app for customers to place order doesn’t make it a tech company. ToysRUs sells toys online, but no one calls ToysRUs a tech company. 

They are not tech companies also because the technology is not the key ingredient to found those companies and make them successful businesses. A tech person (like me) don’t have know-how in those business sectors. Instacart? Maybe OK. But definitely not Wevorce or Blue Apron. Wevorce was founded by a family lawyer and Blue Apron was started by a chef and a VC. 

In these cases, technology (mobile, data, etc.) is more like the enabler and catalyst. Technology can give these companies an edge over the disruptees in the trade. But if they don’t get the core of their trade right, technology won’t matter. If the spinach in Blue Apron’s big box had already wilted when it arrives my door steps, if they recipes tasted no much difference than average family meals, they would not have been successful. 

Don’t take me wrong. Instacart and Blue Apron are still awesome business innovations. Just don’t call them tech companies any more.

My Four-Buckets Engineering Velocity Model

When it comes to looking into bottleneck and improvement opportunities in the engineering velocity area, I use a four-buckets model, in terms of how long a task takes:

  1. Instant. This is something that only takes a few seconds to half a minute. Tasks like running a small set of unit tests, compiling a sub-folder or do a “git pull” for the first time in the last several days are in this bucket. While waiting for such tasks to finish, I don’t leave my desk. I would catch up on some quick conversations on IM, take a peek on my cellphone or reply an email while waiting.
  2. Coffee break. A coffee break task takes a few minutes, such as apply my private bits to a one-box test instance, run a large set of unit tests, etc.. Some time I go for a coffee or use restroom when such tasks are running.
  3. Lunch break. When a task takes longer time, such as half an hour or 1+ hour, I will grab a lunch while it’s running. Sometime I start the task when I leave office to pick up my boy and check the result when I get home.
  4. Overnight. Such task takes quite a few hours, or up to about half day. So we have to run them overnight: usually start at the night, go to sleep and check the result when we wake up in the next morning. If it’s started in the morning, we probably are not going to see the outcome until the evening.

Over the years, I have learned a few things in this four-buckets model:

  • A task’s duration will slowly deteriorate within the same bucket without being noticed, until it’s about to fall into the next bucket. For example, the build time of a code base may be 10 minutes in the beginning, which put it in the coffee break bucket. It can get slower over the course of the next several months, become 15 minutes, 20 minutes, …, as more code are added. Few will notice it, or be serious about it, until the build time gets close to half an hour, which is no longer a coffee break task, but a lunch break task. People feel more motivated/obligated to fix things to keep a task remain in the current bucket, than prevent it slowly deteriorating within the same bucket.
  • For maximum effect, when we make engineering investments in shortening a task’s duration, we should aim to move it into the next shorter bucket. Incremental improvements within the same bucket will have less impact on engineering velocity. For example, if an overnight task is shortened from 12 hours to 6 hours, it’s still an overnight task. But if it can be further shortened to 3 hours, that will transform the work style in the team: the team will be able to run the task multiple times during the day. It will dramatically change the pace of the team.
  • Incremental improvements within the same bucket are less likely to sustain, due to the first observation mentioned above. It’s going to be like Sisyphus rolling the stone uphill. Unless the stone is rolled over the hill, it will go back down to where it started. To avoid such regression and frustration, our investment should be sufficient to move the task into the next shorter bucket, or don’t make the investment and put the time/money/energy somewhere else.
  • There is a big difference between the “Instant” bucket vs. the next two, the coffee break tasks and the lunch break tasks: whether I have a context switch. For the tasks in the instant task bucket, there is no or little context switch. I don’t leave my desk. I remember what I wanted to do. I’m not multi-tasking. Once the task becomes longer and gets into the coffee break bucket, my productivity is one notch down. I have context switch. I have to do multi-tasking. We should try really hard to prevent the tasks in the “Instant” bucket from getting slower and dropping into the coffee break bucket, to save context switch and multi tasking.
  • Similar to the previous point, there is also a big difference between the coffee/lunch break bucket vs. the overnight bucket. On the tasks in the overnight bucket, I do worse than context switch. I sleep. It’s like close the lid of a laptop. It definitely takes much longer time and more effort to get the full context back after a sleep, than after a lunch break. We should try really hard to prevent any task slipping into the overnight bucket. It’s about whether it’s same day or not. Same day matters a lot, especially psychologically: in the past, we didn’t really feel the difference between Prime’s two-days shipping vs. the normal 3-5 days shipping; but when Prime has the same-day shipping, it feels substantially different.

Actually, there is a fifth bucket: “over the weekend”. Such task takes more than a day to run. I didn’t include it in my four-buckets model because if an engineering team ever has one or more critical tasks in the over-the-weekend bucket, they are seriously sick. They are deep in debt (engineering debt) and they should stop doing anything else[1] and fix that problem first, to at least move it into the overnight bucket. In a healthy engineering team, all the tasks can be done over a lunch break or sooner. Everything is same day. There is no overnight task[2]. That’s the turnaround time required to deliver innovations and customer values in the modern world.


[1] Just being exaggerate to highlight the point.
[2] With reasonable exceptions, such as some long-haul tests. Though many long-haul tests that I have seen could be replaced by shorter tests with certain testability designs.

Promotion Is Not a Birthday Gift

This week people in Microsoft are getting their annual review results: how much annual bonus and stock award they are getting, how much is the merit increase, are they getting a promotion or not.

Here is a true story that I’ve just heard today. A friend of mine, Sam[1], has told his manager on this Monday that he is leaving Microsoft to join another tech company in the region. At the same time, his manager delivered the annual review result to him. Surprisingly, he has got a promotion. Although Sam believed he deserved and qualified for a promotion since early this year, for various valid reasons he thought his chance would be slim this time. So he started to prepare for interviews a few months back, talked to a few companies and got the offer. The new job pays significantly higher than what he gets in Microsoft. So the promotion probably won’t change anything. No much loss to Sam.

But it’s a loss to his manager and Microsoft:

  1. The promo is kind of wasted[2]. It could have been given to someone else.
  2. Microsoft has lost a good engineer and there is a cost to replace him.

This is the reason why I never give my team members surprise when it comes to promotion. Promotion is not a birthday gift. “No surprise” is my rule of thumb in people management and other business scenarios. Not even a good surprise like promotion. I always told my people very early that I am getting him/her a promo. Then I keep him/her updated. A typical timeline looks like this:

  • May 1: “I’ve written a promo for you. Please take a look and let me know what I have missed”
  • May 15: “I have submitted the promo justification”
  • May 29: “I have presented it in the calibration meeting and there wasn’t much push back”
  • June 10: “The promo seems to be a done deal”
  • June 16: “The promo is OK at VP level”
  • July 5: “I haven’t heard any change to your promo”
  • Aug 15: “Here is your annual review result. Congratulations on your promo!”

If Sam knew he’s getting the promo back in May/June, he would likely not start looking outside, hence would not get this offer that no way Microsoft can match and would stay in Microsoft. Microsoft would have kept this talent.


[1] The name is made up.
[2] This statement is overly simplified. Do not misinterpret.

7 Things We Did Right in a Successful Data Migration Project

Someone was asking on Quora about how manage the migration of data when there is a database schema change. I shared how we did in a real data migration project back in 2006/2007. It was a payment system (similar to today’s stripe.com, but ours wasn’t for the public) that ran on .NET XML Web Service + SQL Server. In a much simplified way for ease of writing:

  • It had a Subscriptions database, in which there is the payment_instruments table, where we stored encrypted credit card numbers.
  • Having subscription_id on the payment_instruments table implied that we assume every payment instrument must belong to one and only one subscription.

2015-07-data-migration-before

Now we wanted to support standalone payment instruments, which doesn’t belong to a subscription. So we needed to migrate the payment instrument data into a new payment_methods table in a new Payments database:

2015-07-data-migration-after

It was a very successful data migration project. It had done quite a few things right, which I will repeat in any future data migration projects:

  1. We kept the old payment_instruments table. We added a new payment_method_id field to the payment_instruments table, so that the payment_instruments table acts as a proxy. The benefit is: we can keep most of the legacy code untouched, which can continue consume the payment_instruments table. We just need to change the data access layer a bit, to back fill the encrypted credit card number from the new payment_methods table, when all other legacy code is querying the payment_instruments table.
  2. We added a payment_method_migration_state field to the old payment_instruments table. This field is to indicate whether the old or the new table is the source of truth. We used an explicit field to be the indicator, rather than use an inferred value (for example, by looking at whether the encrypted_credit_card_number field is null in the old payment_instruments table), because an explicit and dedicated indicator of migration status is much less confusing than inferred status, which is usually more error prone because it gives something already in use a new meaning (on top of the original meaning). Also, the explicit indicator serves as a lock a little bit: when a migration is in progress, some update operation should be blocked.
  3. We use both online and offline migration. Online migration: any time a mutation API is called on a payment instrument, such as UpdatePaymentInstrument or PurchaseOffering (with a certain payment instrument), the migration code is triggered and runs in the Web frontend, which insert row to payment_methods table, copy over the encrypted_credit_card_number value, back fill the payment_method_id in the old table and set the payment_method_migration_state. Offline migration: we have a standalone tool running in our datacenter, which go through the existing payment instruments and migration them one by one. The reason we had offline migration on top of online migration was because some customers only used our system very infrequently, such as once every three months. We don’t want to wait for three months to migration their data.
  4. Controlled migration at per-customer level. We designed it in a way that we can select a batch of customers to be eligible to do the migration (in both online and offline). In that way, we can start with a very small number (say 100 customers), and expand to 1000, 10000, 10% of the system, then all. We did find some critical bug during the first several small batches.
  5. Due to compliance requirement, we must not keep the encrypted_credit_card_number data on the old table. But we didn’t do the deletion until the entire migration is done done. That’s because if anything seriously goes wrong, we still have chance (even just in theory) to go back to the old data schema. Actually, we did have some bug which messed up data (putting encrypted_credit_card_number on the wrong payment_method_id) and having kept the old data allowed us to redo the migration correctly. It saved the day.
  6. We made the two new fields on the old payment_instruments table Nullable, rather than a default value, to prevent the data page from rearranging for the existing rows (nearly hundreds of millions of them). For the same reason, when we removed the encrypted_credit_card_number data on the old table, we didn’t delete it but set it to an all-spaces string which has the equal width as the original encrypted blob.
  7. During testing, we modified the deployment script to be able to deploy both old and new version of the frontend side by side. Because the AddPaymentInstrument API in the new version will always put data in the new schema. We needed the ability in our test automation to create data in the old schema, in order to test the migration code. This ability is actually not only useful in data migration project, it’s generally useful in online services: it’s always good to know whether the data created by older version(s) can be correctly handled by the new version.

The above 7 things that we have done right will be applicable to future data migration projects that I will do. #6 (preventing data page from rearranging) may be specific to SQL Server, but its spirit is widely applicable: better understand the underlying implementation of the database system, to minimize the performance hit when migrating non-trivial amount of data or touching a lot of rows.

Besides, two more takeaways of mine are:

  1. Have the right expectation. Data migration will be hard. After spending all the time in design, implementation and testing, the actual migration will also take a lot of time. In our project, we ran into weird data patterns in production that we never thought it would be possible. It turned out to be the result of some old code which is now gone (either retired, or fixed as a bug). In production, we also discovered quite some bugs in our migration code that were hard to discover in test environment. It takes many iterations to discover them, fix, test the fix, roll-out the new bits, resume the migration and discover a new issue. It would be helpful if you could get a snapshot of full production data to test your migration code offline. But in some cases, due to security/privacy/compliance, the data to be migrated must not leave the production data center and sanitizing it will defeat the purpose.
  2. Do not do migration of frontend and database at the same time. If you must abandon both the old frontend (e.g. REST API, Web UI, etc.) and old database, do it in two steps: First, do the data migration. Keep the frontend unchanged to customers, and only change the frontend code under the hood to work with the new database. Second, build a new frontend on top of the new database. For sure the two-steps way sound more costly. But in my experience (I have done both ways in different projects), the two-steps way counter-intuitively will end up more cost efficient, less risky, more predictable and more under control.