When Code Review Becomes Annoying

Overall speaking, code review is useful.

From time to time, my reviewers did catch some bugs that I overlooked, or point out options that fell into my blind spot. Sometimes I learn new things in code reviews. For example, last year a reviewer suggested me use ThreadLocal<T>, which indeed would simplify my code a lot. Some code reviews aren’t the code review per se, when it’s more about a FYI: “Hi, I have written this code and I like you to be aware of and get familiar with it.” That happens when you are the only person in the team truly understand that piece of legacy code. Although such FYI type of code review isn’t that much helpful to the author, it’s still good to the team.

But sometimes code review can become annoying, especially when people spend time on things that (in my opinion) don’t really matter. For examples:

I understand there are differences, often very subtle or trivial though, between string vs. String, readonly static vs. const, etc.. But those differences don’t do any real harm. Explicit declaration, such as Stopwatch stopwatch = Stopwatch.StartNew(), doesn’t make the code harder to read[1] than var stopwatch = Stopwatch.StartNew(). String.Join doesn’t make the code slower than using string.Join. Putting using outside of the namespace block doesn’t make the code harder to work on. In addition, by default all versions of Visual Studio put using outside of namespace in the code it generates.

I really don’t like to spend time on those things in code reviews. I don’t think they matter to product quality and engineers’ productivity. There are so much more things that we wish we had more time to spend on to improve quality and productivity. Debating on those things are like debating whether to put one space after a period or two.

What people should do is to make sure that their team has collectively decided on what StyleCop[2] rules to turn on in their code base and get included in the build. Once that’s decided and has taken effect, there will be no debate any more: if the rules are violated, it will be a build error and we don’t submit code review until it at least passes build. Simple and clear.


[1] Readability is both objective and subjective. There is no doubt about that a line longer than 120 or 150 characters is hard to read and single letter variable names are hard to read. But whether Stopwatch stopwatch = Stopwatch.StartNew() is harder to read than var stopwatch = Stopwatch.StartNew(), that’s really personal.
[2] Or the equivalent of StyleCop in other languages.

Too Many Retries Erode the Quality

In one of the teams that I have worked in, our code was full of retries. To name a few:

  • When the code was looking for a storage account, it would retry up to 30 times when the storage account didn’t exist. The whole retries took 13 minutes.
  • When the code was looking for a setting value which didn’t exist, it would also retry 30 times which took 13 minutes;
  • When the code was trying to read an XML file from a file share, it would retry up to 30 times for 13 minutes when the XML file didn’t exist.

That was too much. When a system has too many retries for too many times all over the place, the worst consequence isn’t that it wastes time. The worst consequence is that it erodes the quality of the system in an irreversible way. I have seen some groups where people were used to simply adding more retries and retry for more times when they run into intermittent issues. Lots of genuine code bugs are intermittent in its nature. Adding retries will cover these bugs up. But over the time, as the code gets reused, as the system scales, the only way to maintain the same level of availability is to add more and more retries. It’s like only reinforcing the siding of the house when its frame is being eaten by termites. Adding retry is simpler and less costly than getting to the bottom of intermittent issues. It takes some courageousness to not go down the easier (but irreversible and poisonous) path.

My rule of thumbs for retry is that we should not retry, or only retry for very few times, in the following situations:

  1. AuthN/AuthZ failure
  2. HTTP 404 or similar error code indicating resource not found
  3. HTTP 400 or similar error code indicating bad input
  4. Timeout

In most of the time, if a resource is not found, it’s not found. Retry won’t help. The resource is not going to show up all of a sudden several minutes later[1]. In general I also avoid retry on timeout error. Timeout indicates something has become very slow. Maybe the database is too busy. Maybe the frontend is under high load. Retry on timeout will likely make things worse.

The situations where I think it’s necessary and useful to do some retries are:

  1. Network glitches, such as “System.ServiceModel.CommunicationException: The socket connection was aborted
  2. HTTP 500 or similar error code that indicates a server side error
  3. HTTP 503 or similar error code that certain systems use in throttling. Sometime the response will include some guidance for the client: whether the client may retry, how much time should the client backoff for, …

It worth pointing out that retry should be mainly used to smooth out glitches[2], rather than for handling outage. If a service/resource has problems for more than a few seconds, it’s an outage. When there is an outage, the caller should treat it like an outage: mark the operation as failure, write necessary logs, tell its caller that things have failed. Usually I frown when I see a code spend more than a few seconds to retry.

Another often debated topic about retry is where shall we put the retry. My take is: either as close as possible to where the failure happens, or as close as possible to the originator of the bigger operation. The benefit of putting retries close to where the failure happens is that you won’t have to worry about idempotency and side-effect too much when your code retry. But you have to be very careful with the footprint of the retry. Excessive retries in the lower level code can be amplified when it’s placed in the whole system.

Last but not least: do not have a default retry count and interval in your code base. In the examples provided above, they all retried 30 times for 13 minutes. That’s because they all use the internal Retry lib with default values for retry count (which was 30) and interval length. To make it worse, it spreads when people copy and paste code (they do!). Not having a default retry count and interval length will enforce the developers to make conscious choice every time about how many times they want the code to retry.


[1] Retry is different than polling. We use polling when we expect to see changes. It’s OK to spend several minutes or even longer to wait for things to happen.
[2] People also use other terms like “hiccups”, “blips”, “transient error”.

成功是失败之母

去年圣诞节我在纽约玩的时候很惊喜的发现那里的出租车都支持Apple Pay了。每辆yellow cab的前座背后都安了一个Apple Pay的”读卡器”,到了目的地以后,坐在后座的我只需要把iPhone靠上去,大拇指放在Home键上把指纹识别出来,支付就完成了。不需要把手机解锁,不需要打开任何app,也不需要扫码什么的,很方便。

不过美国的移动支付总的来说还是比国内发展的慢很多的。正如前阵子朋友圈里有人在说的,现在在上海出门基本上都能用手机支付搞定了,可以不用带信用卡了。美国的手机支付发展的落后的主要原因之一是信用卡太发达了。而且这两年有了Square以后,连卖几块钱一份冰沙的小摊都可以刷信用卡了。信用卡那么方便,自然就没有动力、没有需求去拼命搞拼命推手机支付了。

所以从某种意义上来说,成功是失败之母。一个社会、一个国家、一个市场,某个阶段发展得太好太成功,会增加它向下一个阶段发展的阻力。也许再过十年二十年,等到那时候回头看今天,或许我们也会看到在互联网和高科技的一些其他方面,中国今天获得的成功变成了今后十年二十年发展的阻力。

美国由于地广人稀,人力成本很高,所以最近几年非常积极的在往robotic和无人化上面靠:已经在很多超市普及的self checkout;Amazon的无人机送货;Boston Dynamics的机器狗;很多人家里的iRobot扫地机器人。在今年的国情咨文里,总统还宣布要支持和加快self-driving cars的开发和应用。Newark机场航站楼里的餐厅基本都没有服务员了,每张桌上都有一个tablet,吃饭的人坐下后在tablet上点单,吃完以后直接在tablet上刷卡买单。整个用餐过程中需要服务员的地方仅仅是把菜从厨房端出来。

中国的人力成本相对低很多: 钟点工小时工又多又便宜,骑着电瓶车的快递员满大街的跑,超市雇几个收银员也不算难事,连马路边上只有两个坑位的公共厕所亭都有个人专门负责收钱。罗振宇说,中国的电商比美国发达太多,是因为中国人多、人便宜。这话没错。但这可能也会导致中国的公司和用户缺乏动力和旺盛的需求去搞扫地机器人、无人机送货、self checkout、serverless restaurant和driverless car。梁建章在他的书里说,中国人口资源丰富是好事,但也可能导致”资源的诅咒”。我觉得他说得是有点道理的。

Do You Have A Reservation?

2015年的最后一天,吃完午饭我去配眼镜。

在美国配眼镜和在国内很不一样。在美国配眼镜,比如近视眼的眼镜,是要有医生的处方(prescription)的。像我这种近视眼的人,要先去看眼科医生(OD,Doctor of Optometry)做eye exam,根据检查结果,眼科医生会开一张处方单,上面写了我这个人应该戴多少多少度数的眼镜。而且这个处方单还区分普通眼镜和隐形眼镜的,有些处方单上写清楚了是只能用于普通眼镜的。然后我拿着处方单去眼镜店,自己选好镜框,眼镜店按照医生的处方帮我把眼镜给做了。眼科医生的处方有效期只有一年。一年以后如果还要再配眼镜,就要重新做eye exam拿处方。

这个过程听上去相当的繁琐。不像国内,哪要什么处方啊。我自己在国内那么多年配眼镜,除了小学五年级的时候配第一幅眼镜的时候去的是大医院验光,当时还滴了一天眼药水扩散瞳孔,之后基本上就都是”电脑验光”这样子搞搞就好了。不过好在美国很多的眼镜店都是有”坐堂医生”的,至少我去过的几家都有。一个眼科医生就把自己的诊所开在这家眼镜店里,平时的业务主要就是帮来配眼镜的客人验光开处方。比如我去的那家眼镜店,那个眼科医生叫Dana Kindberg,她的小诊所也是五脏俱全:有个前台负责接待和billing(包括insurance),有两个助手帮着做一些常规的检查,再加上她自己。拿了处方以后出来到店堂里面就可以直接配眼镜,付好钱等一个小时就能拿到配好的新眼镜了。所以其实是挺方便快捷的。

不过因为是2015年的最后一天,眼镜店里人特别多。因为很多医疗保险每年都有一两百块的额度可以报销配眼镜的费用,不用白不用,配一副备用的眼镜放家里也是好的。所以每年最后一两天眼镜店都特别忙。我去的时候是下午两点,眼科医生的前台说他们只有四点二十分还有一个空,而且是另外一个人cancel了以后空出来的。于是我就约了四点二十分的eye exam,回去继续上了会儿班,然后去幼儿园接了郑轶嘉,等到了四点二十回到眼镜店,验光、配了眼镜。

很多像我这样在国内生活了几十年后才来美国的人一开始都不太喜欢、不太习惯美国这种到处要预约的做法。我自己原来也不是很喜欢,不是很习惯。车子servicing要预约(oil change可以直接去),看病要预约,拍B超拍X光片要预约(也不是绝对的,我的家庭医生的诊所自带X光,上次背伤了就是当场拍的片子),国家公园里的campground要预约,去San Juan岛的车渡要预约,去Apple Store修手机要预约,帮郑轶嘉理发也要预约。不预约直接walk in的话八成是理不到发的,倒不是因为理发店都坐满了,而是因为当天来上班的理发师一整天都已经排满了。去很多地方,前台见到我的第一句话往往都是”Do you have a reservation?”

习惯了这种到处都要make reservation的社会以后,觉得这种做法其实挺高效的。从客户的角度来说,容易安排自己的时间,不用担心白跑一趟,也不用操心估算要花多少时间排队。否则,排队的时间是一个未知数,未知数一多,时间就很难安排。排太紧了,万一一件事情上因为花了很多时间排队而delay了,余下的每一件事情就都要往后推,也许还会浪费别人的时间。在一个大量实行预约制度的社会里,每个人的时间都有比较好的可预见性(predictability)。

大量实行预约制度对商家也是有益处的,能帮助商家很有效的安排资源,减少浪费,不会出现很多技师、服务员、销售员无所事事坐在店堂里的情况。预约制度就像一只无形的手,既帮助客户避开高峰时间,也帮助商家对客流和需求进行”削峰填谷”。”削峰填谷”的好处是很显而易见的:国家通过分时电价将一部分用电需求引导到晚上非高峰时间;很多旅游区的配套资源在五一、国庆长假期间大大超载,但在平时大部分时间又大量过剩,因此有很多人提议减少长达一周的长假,增加3-day weekend之类的小长假,增加带薪休假,对游客量进行”削峰填谷”。

Better predictability,more efficient resource allocation,less waste,所以总的来说我还是比较赞成广泛的实行预约制度的。

为什么应该增加男性的带薪产假

因为增加男性的带薪产假,获益的将会是女性,而不是男性。因为增加男性的带薪产假,将会减少女性在职场上的劣势。

中国的很多公司都只给男性两周的带薪产假,但女员工可以有三个月产假,超过一定年龄满足晚婚晚育标准的可以拿到四个月带薪产假。美国的情况也一样。微软之前一直是男性四周、女性十二周。今年微软调整福利了,变成男性三个月,女性五个月。很多其他公司也类似,尽管产假长度各有不同,但总体上都是女性产假比男性产假多很多。再放眼世界上其他国家,情况也基本如此(参见:https://en.wikipedia.org/wiki/Parental_leave)。

如果一个社会普遍只给男性两周产假,但给女性三个月或四个月,那么生育后代对女性职业发展的影响会远远大于对男性的影响。

两周产假也就相当于一个普通的带薪休假,相当于去国外玩了两个礼拜,回来上班以后,一两天就能把过去两周里没有看的邮件给看完了,一两天时间里就能跟上工作节奏。休两周产假对工作的影响微乎其微。休三个月或四个月的产假就完全不一样了。离开工作三个月或四个月再回来,可能连自己生小孩前最后几周写的代码都已经看不懂了。这三四个月里面,可能发生了reorg,可能有新的项目已经开始了,可能换了老板了,可能原来的设计已经改掉了。每个人又都那么忙,谁有功夫跟你一点点都解释清楚啊。休完三四个月产假回来,状态其实很接近一个刚入职的新员工,至少还要再花几个礼拜才能完全进入工作状态。

很多公司都是每年做一次绩效考评的。如果这三四个月的产假发生在那一财年的上半年,那还好,回来以后catch up了以后还有半年才到考评的时间。这半年好好努力,这一年受的影响可能还不算太大。但如果不巧不巧,这三四个月的产假发生在年中或更晚,那休完产假回来上班时,已经离考评不剩多少时间了。这么一点点剩下的时间里也做不出什么能够大书特书的事情来。那么那一年的考评就算是打了水漂了,基本上就拿个平均分,升职什么的等下一年吧。很多情况下,浪费的可能不止一年。很多老板知道手下女员工怀孕了以后就不给她们分派很关键的岗位了,因为怕她们哪天突然就进产房了,工作交接都没做完。如果再加上怀孕期间身体状态不好,再加上花在产检上的时间,很多时候生一个小孩会耽搁掉两年。

大部分的人生小孩的时候是在二十多岁或三十岁出头一点,工作的时间少则三五年,多的也就是十年上下。工作到这个阶段,耽搁掉一年两年的话其实是一个非常巨大的损失。这个阶段的人基本上面临着职业发展上一个不大不小的坎:比如要升Senior或Staff了,或者要从IC升Lead了,或者是要从一线升二线了。如果到了那个时间点那个坎没过去,接下去可能就要再等好多年。比如,一个女员工,如果那时还没升成Senior或Staff,等生完小孩,能够花在工作上的时间肯定是多多少少要减少的,之后她们在和那些还没生小孩、比她们年轻三四岁的同事竞争的时候,就要吃很大的亏。

相比之下,同龄的男性就没有这么大的劣势。他们基本上不会浪费掉什么时间。男性不会有孕期不适,不需要产检,产假也只休两周,家里还有父母帮着做家务,还可以请月嫂或钟点工帮忙,基本上工作上一切可以照旧,该加班加班,该出差出差。这就是为什么在较低层级的职位上,女性和男性的比例还不算太不平衡,但在很多公司很多行业,到了中高级经理层往上,女性的比例就大大减少了。原因就是因为很多女性因为生小孩而掉队了。如果男性的产假和女性一样长,女性掉队的情况至少是能够大大缓解的。

男性产假只有两周还会对女性的职业发展产生另一个隐形的负面影响:男性如果只休两周产假,就不能深刻的切身体会到带小孩的辛苦。

两个礼拜时间一晃而过,很多情况下妻子也还没有去上班,加上可能还有上一辈的老人来帮忙。在这两个礼拜里,换尿布、拍嗝、热奶、白天抱着睡、晚上哄睡、给小孩洗澡等诸多caregiving的事情,很多情况下还是被妻子或者来帮忙的老人做掉了。很多爸爸休了两个礼拜产假以后,还是不怎么会换尿布、不知道怎么让孩子入睡。没有大量的亲身体验,就不懂得其中的艰辛。没有经历过抱着孩子午睡一睡睡上一个小时两个小时抱到整个肩膀和胳膊发酸发麻抬不起来,没有经历过每天半夜里要挣扎起来喂一次夜奶,这些爸爸们因此也就不能够发自内心的感激妻子的付出。他们的观念可能就一直停留在这样的无知和误解上:”抱着睡有什么难的,我还巴不得天天抱着我儿子/女儿呢“,“为什么要抱着睡呢,为什么不把他/她放到摇篮里去睡”,“为什么要喂夜奶呢,就是因为我们一直喂夜奶他/她晚上才会一直醒过来的呀”,等等。

消除这些无知和误解的最有效的方法就是让爸爸自己动手带孩子,而且带上一段时间。我自己当时休了一个月产假。那时候朱逢霖已经休完了四个月产假,回去上班了。我的那一个月里,白天就我一个人在家里带郑轶嘉,晚上也是我起来给喂夜奶。那一个月里,我充分的体会了朱逢霖在之前的四个月里有多么的不容易。我相信,如果当时微软已经有三个月产假给男性员工了,我会休满三个月,从而对带小孩的艰辛体会更深。

2015-12-21-Paternity-leave

那些在两个礼拜的产假里没有学会怎么给小孩换尿布的爸爸,可能之后就再也不会去学怎么换尿布了。之后的一两年或更多的时间里,换尿布可能就都是妈妈的工作了。而且,由于这些爸爸没有学会怎么换尿布,偶尔换一下也换的笨手笨脚的,很多妈妈就索性把这事儿全揽过来了:“算了算了,我来换吧,你去把衣服叠一叠吧”。类似的,很多其他caregiving的活也引起类似的原因就都落在了妈妈身上:“算了算了,我来哄他/她睡吧,你去忙吧”,“算了算了,我来喂他/她吃饭吧,你去把碗洗了吧”。这样,在一个男性产假只有两周,而女性产假有三到四个月的社会里,女性不单单因为休产假造成了那一年两年的职业发展停滞,还在今后的两三年里都比男性承担了更多的caregiving工作,进一步造成了女性在工作上的劣势。

也许有人会说,哪怕国家的法律政策给了男性和女性同样多的产假,比如都给三个月,很多男的未必会休啊。这是绝对是有可能的。我见过很多人,他们每年的带薪年假也都没有休完。但给了男性和女性同样多的产假以后,至少有那么一部分爸爸是会休的。今天,也许有一部分的丈夫会有这样的借口:“我是想多休一点产假在家里帮忙的呀,但国家规定只给我们两周产假呀”。到了那时候,至少这样的借口就没有了。哪怕一开始几年里只有10%或20%的男性休满了三个月的产假,那也能促进社会风气和观念的转变。在我所知道的公司里面,Facebook是做的最好的。Facebook无论男女,都有4个月的产假。而且,相当多的Facebook的男性员工都休完4个月的产假的。这让我相信,如果国家的法律规定,所有的企业和政府部门都必须为男女职工提供相同时间的带薪产假,很快就会有很多男性愿意休更多的产假的。

我认为,在中国制定这样的法律,不但是有益的,而且应该尽快的制定和生效。否则,全面放开二胎的效果会大大打折扣:原先只生一个,就会对女性的职业发展产生这么大的影响,那现在要是生两个了,那工作基本上就算是毁了。这样的结果就是,很多职业女性要么不愿意生第二个,要么生了第二个以后就索性不工作了。无论是那种结果,最后都会导致放开二胎所要达到的劳动人口增长的目标无法实现。

德胜门

高晓松说,他小时候能从德胜门看到香山,后来就只能从德胜门看到西直门,而现在,在德胜门都快要看不到德胜门了。蛮有意思的,典型的北京人的调侃。

德胜门,听着好亲切的,因为我也算是北漂过一年,那时就住德胜门。严格来说是德胜门外大街,简称”德外”。对应的另一侧就是德胜门内大街,简称“德内”。那时候有人问我住哪里,我回答说”德外”的时候都觉得自己特神气。我在北京工程院那一年里就一直住那儿,每天早上走十分钟路去积水潭站坐地铁,每天就都会路过德胜门。我特别喜欢从德胜门前路过的那感觉,所以后来回了上海,有时候上班会绕一点点路,从南丹路的家里去港汇上班,我会从徐家汇的天主教堂前面过。

每天上班的路上都能看到一个特漂亮的东西,无论是山水还是宏伟的古代建筑,以那样的方式开始自己的一天,是一种特美好的感觉。

2015-12-11-lake-sammamish

后来到了美国,先是在Sammanish湖边上住了一年。每天上班都会沿着Sammamish湖往北开十几分钟,然后才接上繁忙的highway。那段湖滨路特漂亮。这大概也是为什么那条路叫做East Lake Sammamish Parkway的原因。好像在美国,凡是叫Parkway的都是风景有点漂亮的路,开着车经过就是一种小小的享受。郑轶嘉生出来以后我们就搬到了现在的家。现在每天从郑轶嘉上幼儿园的路上有一段路正对着Mountain Rainier,那高高的雪山,有时候半遮半掩的在云里雾里,有时候山顶耸入云里像是戴了一顶帽子,有时候万里无云,山上的积雪就闪闪的泛着金色的晨光。每天经过的时候郑轶嘉都会大叫:”爸爸,look,Mountain Rainier!”

我觉得郑轶嘉长大以后一直会记得他小时候每天上学路上看到Mountain Rainier那个moment,就好像高晓松记忆中的从德胜门看到的香山。

郑轶嘉坐飞机

我是到念高中时才第一次坐飞机,从西安回上海。朱逢霖第一次坐飞机是23岁。郑轶嘉才四岁,已经坐了四十几次飞机了:

  • 2012:西雅图往返Honolulu;西雅图往返San Diego;西雅图往返Calgary;西雅图往返上海,经停首尔;西雅图往返洛杉矶。
  • 2013:西雅图往返Las Vegas;西雅图往返纽约;西雅图往返Fort Lauderdale,经停休斯顿;西雅图往返罗马,经停法兰克福。
  • 2014: 西雅图往返旧金山;西雅图往返Tahiti,经停洛杉矶;西雅图往返伦敦,西雅图往返上海,经停东京;上海往返珠海。
  • 2015: 西雅图往返法兰克福;西雅图往返旧金山;西雅图往返Maui;再加上这个月底就要飞的西雅图往返纽约。

我们带着郑轶嘉去了那么多地方,动机其实很纯粹:就是我和朱逢霖我们自己想到处玩。我们家里没有这个条件,不像其它人家那样可以把郑轶嘉留在家里一两个礼拜,让国内过来帮忙的老人在家里看着,或者索性放回国内去放几个月。我们无论去哪里都不得不把郑轶嘉带着,包括好几次陪我们去旧金山搞申根签证。我们没有觉得想要通过旅行让郑轶嘉增长见识,至少不是现在这个年纪。这个年纪的小孩还没什么长期的记忆,比如我自己就只记得五六岁以后的事情。我们发现郑轶嘉基本上两岁之前的事情完全是不记得的。他24个月大的时候跟我们去了意大利,吃了很多pizza和意大利面,在威尼斯的圣马可广场上被一群鸽子抢手里的面包。这些事情后来问他,他完全不记得了。

2015-12-03-Yijia-Take-Airplans-0

郑轶嘉大概是在两岁到三岁之间开始记得一些事的。去年他30个月大的时候跟着我们去了Tahiti,跟我们去浮潜的时候看到了stingray,似乎就记住了。上个月去毛伊岛的水族馆,他一看到stingray就指着叫出名字了。他将近三岁的时候跟我们去英国,在伦敦的白金汉宫门口看到卫兵换岗,看的特别兴奋,告诉我们说他很喜欢看”叔叔打鼓”。回来后有一次在家里,看到了我们带回来的一个卫兵样子的冰箱贴,他就指着那个冰箱贴对我们说,那是”叔叔打鼓”。我觉得可能从现在起,从四岁往后,他应该基本上就都能记得了,以后的旅行可能会有更多增长见识、锻炼能力的成份了。

话说第一次带郑轶嘉坐飞机是他四个月大的时候去夏威夷。那之前我和朱逢霖还是满紧张的,不知道这么小的小孩出门在外会遇到什么状况。不过那次坐飞机基本还算顺利,就是郑轶嘉晕机吐了。所以之后我们带他坐飞机都会在随身箱里放一套备用衣服。尤其是飞机下降的时候我们会特别警惕,发现郑轶嘉有神情不太对劲的时候,就赶紧把呕吐袋拿在手里准备好。不过好像随着年龄增长,最近他很少晕机了,也许是因为坐得多,习惯了。也许他当时呕吐也并不是因为晕机,也许是因为小孩小的时候胃的贲门没发育好,机舱里气压低,胃里面东西就容易跑出来。大概和婴儿吐奶是个差不多的原理,我是这么猜测的。郑轶嘉倒基本上从来没有因为气压调整而觉得耳朵不舒服。我觉得那主要是因为从小就坐飞机,一年做十几次,早就习惯了。

郑轶嘉第二次去夏威夷就是上个月的事情了,已经过了四岁生日了。这次去夏威夷带郑轶嘉坐飞机,我和朱逢霖基本上已经非常轻松了。Checkin和安检的时候他会一直帮我们拉着那只随身箱。郑轶嘉非常喜欢那只箱子。一个原因是那只箱子的轮子的质量很好,万向轮,而且滚起来特别平滑、顺畅。郑轶嘉年纪还小的时候就喜欢推着这只箱子在飞机场里走,他总说那是他的箱子。现在他四岁够高了,就学我们的样子拉着箱子走,一路引来不少侧目。

2015-12-03-Yijia-Take-Airplans-1

这次从夏威夷回来,郑轶嘉上了飞机一坐下来就自己把安全带扣好,然后就催着我和朱逢霖:”爸爸,你和妈妈也要把安全带扣好”。机长广播说飞机要开始滑行了,请大家收好小桌板。郑轶嘉就马上把iPad合上,还跟我们说,”妈妈,我等到飞机飞平了再看,OK?” 听得我和朱逢霖都笑死了。飞机飞了一会儿以后开始广播,说开始提供drink了。郑轶嘉那时候虽然带着耳机,但广播一说要开始提供drink了,他噌噌噌就把耳机拿下来,把iPad合上收好,把小桌板上面收拾干净,然后转过头跟我们说“I am ready for drink”, 又把我和朱逢霖笑死了。吃好snack,他又看了一会儿Peppa Pig,就说”妈妈,我困了”,就把耳机拿掉,往朱逢霖腿上一趴,片刻就睡着了,连着睡了两三个钟头,连机上的晚餐也没吃,直到要下飞机了才被我们拽起来。郑轶嘉睡了我们就可以做我们自己的事情了,除了要经常帮他拉一下毯子,免得冻着。郑轶嘉睡着的时候朱逢霖看了两部电影,岩石强森演的《San Andreas》和刘青云、黄晓明演的《暴疯雨》,而我继续看我的书,然后写会儿blog,再看一会儿书。

郑轶嘉能像今天这样,也是有一个过程的。他小的时候坐飞机我们也还是挺累的。那时候他还不大会自娱自乐,飞机上地方小,他喜欢玩的Magna-Tiles摊不开,朱逢霖和我就只能轮流一本一本的给他讲书。他小的时候坐飞机也是会哭闹的。不过好在美国这边的航班上的小小孩普遍比较多。经常机舱里此起彼伏的有小小孩在哭。这样我们家郑轶嘉哭的时候我们也就没有那么愧疚。其它乘客里面,很多人自己也曾经有过带小小孩坐飞机的经历,所以特别能理解。我从来没有遇到过有哪个乘客对周围小孩哭闹有抱怨的。总的来说,整个氛围很宽容。郑轶嘉比较小的时候我们会把他的car seat也带上飞机,装在他的座位上,让他坐car seat里面。这样就感觉有点像坐在车里,他在car seat里面睡的比较舒坦,不容易东倒西歪。不过后来他长大一点了,我们就不把car seat带上机了,否则太拥挤、活动不便。

说到郑轶嘉的座位,美国这边不太好的一点是无论小孩年龄多大,如果单独买票有个座位,就要买全价票。不过两岁以下的小孩可以on the lap,那样就不需要买票。这也是我们在郑轶嘉两岁前去了很多地方坐了很多飞机的原因之一,一旦过了两岁就只能乖乖的给郑轶嘉买全价票了。不过一个很贴心很方便的地方是,无论小孩年龄多大,只要小孩是乘客之一,托运小孩的婴儿车和car seat都不要钱。所以我们到任何地方都是自带car seat,从来没有在租车公司花钱租过car seat。

这几年这么多趟飞下来,我和朱逢霖对于挑选适合小朋友的航班也找到了一些规律。主要的选择条件就是时间,尽量把郑轶嘉afternoon nap安排在飞行途中,他睡着了我和朱逢霖就可以轻松一点,干点自己想干的事情。遇到需要转机的,比如我们从休斯顿转机去佛罗里达,我们有时候会选那种中间间隔两三个小时,而且在转机机场停留的时间是early afternoon的,这样让郑轶嘉坐在婴儿车里推着推着就睡着了。另外,如果时间选在飞机上会供应一顿午餐或晚餐的,也会比较好。小朋友在飞机上有东西吃,不容易闹。总的来说,就是根据小孩的生活作息习惯选航班时间,顺势而为会比较轻松。当然,这样选航班可能就选不到最便宜的航班,每次都会贵一点,有时候三个人加起来要贵好几百,但为了一个比较好的旅行质量和体验,那也还是值得的。

郑轶嘉坐的这四十几次飞机里面,有好多是长途trans-continental的。下次再专门写一篇关于带郑轶嘉坐长途飞机和调时差的。

双职工生活

郑轶嘉五个月大的时候就送去托儿所了。

我们也是不得已,没有其它的办法。我们两个把公司的产假都用足了,朱逢霖还额外请了一个月无薪的产假,这样才对付了一开始的五个月。因为种种原因,我和朱逢霖的父母都没法来美国帮忙带孩子。只有朱逢霖的妈妈来过,但朱逢霖出了月子她就回国去了。那之后的四个月就只有我们两个硬扛着。

等我们两个都回去上班了,就只能把郑轶嘉送day care了。不心疼是不可能的,才五个月大啊,都还没断奶。奶都是朱逢霖泵出来冻在冰箱里,每天早上拿几包出来带去day care,她们白天化冻了热一下给郑轶嘉喝。郑轶嘉在这家day care待了一个月,我们找到了一个姓杨的住家阿姨,就把郑轶嘉放家里了。

主要原因是郑轶嘉在那家day care睡的不好。那家day care的条件不能和Bright Horizons的比。Bright Horizons有专门的infant的一大间教室,到了nap的时候关上门,里面安安静静地。那家day care就开在一间普通民宅里,里面大大小小的小孩都有。郑轶嘉那时候每天还是要睡三觉的。一间房间里大一点的小孩在玩,另一间里郑轶嘉睡觉多少是受影响的。

我和朱逢霖都特别重视郑轶嘉的睡眠。我们相信,成年以后的很多睡眠问题的源头来自于婴儿阶段的神经系统的发育和睡眠习惯、睡眠能力的养成。我们也相信,好的睡眠质量能让小孩的更有专注力。相比之下,我们对他什么时候会爬、会坐、会走、会说话、会自己potty等等相对没有那么的在意。

2015-11-30-Yijia-Nap

杨阿姨在我们家做了一年,做到郑轶嘉一岁半的时候。这中间朱逢霖的妈妈来过一次,待了几个月。朱逢霖的妈妈和杨阿姨差不多是同时走的,那时候Bright Horizons也正好有空位了。杨阿姨和朱逢霖的妈妈刚刚走的时候,我们一下子好不习惯。有了对比才深深的觉得她们在的时候生活好轻松的。至少家务活都不用做了:不用烧饭,不用洗衣服叠衣服,不用地毯吸尘。每天到家都有热乎乎的现成饭吃,脏衣服总是会洗得干干净净叠得整整齐齐的回到衣柜里去,水池里的脏碗也好像会自动的就干干净净的回到碗柜里去。

从那时到现在,两年多了,我和朱逢霖就一直是双职工生活。郑轶嘉这个月四岁了。这两年多里,我们仅有的帮手就是请了个钟点工,一个月四百,每周来两次,每次一两个小时,主要就是干些低频的家务,比如擦地板、吸地毯、清洁浴室和厕所什么的。要说不羡慕国内的人那是不可能的,国内的钟点工的工钱相对于我们在国内的工资来说便宜太多了,我们可以请一个小时工每天来做家务,还能把饭烧了。朱逢霖认识的一个印度人就因为这个原因回印度去了:在印度他能请一堆佣人。

家务都自己做,做做也就习惯了。而且还相对有点好处。比如郑轶嘉在厨房吃早饭的时候我就在边上顺便把洗碗机给unload了。这样总比郑轶嘉在吃早饭我在边上看手机好。郑轶嘉也学会了喝完牛奶把杯子直接放到水池里面。洗衣服叠衣服也变成了一项亲子活动。郑轶嘉特别喜欢帮我们把衣服从washer搬到dryer里面,还喜欢把dryer的门关上然后按”开始”按钮。干好的衣服我们经常喊郑轶嘉一起来叠。一开始是让他负责分类:把爸爸、妈妈和嘉嘉的衣服分来,各自堆成一堆。后来他看着我们叠衣服的样子也学着叠,喜欢叠自己的袜子,还不让我们插手。如果这些家务都被钟点工或者过来帮忙带孩子的爷爷奶奶给做了,小孩倒也就没有这样参与的机会了。

2015-11-Yijia-Socks

其实在美国的中国人家庭,大部分的都有父母过来帮忙的。很多家里是爷爷奶奶半年,接着外公外婆半年,然后再爷爷奶奶半年,这样连续不断没有间隔的。还有些家虽然不是不间断的,至少也是每年有半年是有一方的父母过来帮忙的。说不羡慕那是不可能的。有爷爷奶奶或外公外婆在,很多时候可以轻松很多。我和朱逢霖经常会生出这样的感慨来:出去吃饭,有老人在的话可以帮忙看一下,我们自己可以吃顿安稳饭;晚上家里要是有个老人在,我们两个就可以出去,听个音乐会看个球赛什么的。虽然federal law和华盛顿州的法律都没有强制规定,但美国这边的惯例是不可以把12岁以下的小孩单独留在家里的。如果被发现被举报,最坏情况下,小孩是会被带走的。虽然说可以请babysitter,但一方面请一个晚上babysitter也要好几十块钱,另一方面babysitter毕竟不是亲人,小孩小的时候突然要跟一个陌生人待一个晚上,心理上还是满难的。包括那次去Las Vegas玩,晚上我们也只能留一个人在房间陪郑轶嘉,另一个人去看show。

所以没有父母在这边帮忙,夫妻两个会少了很多单独相处的时间,时间久了的确会有种感觉,就觉得自己整天不是忙工作就是围着孩子转。我和朱逢霖很早就预见到和意识到这个问题了,我们想了一些办法来弥补。比如说,我们约好每个月要找一个下午一起翘班出来,吃顿好吃的午饭,看场电影。郑轶嘉上的那家幼儿园,Bright Horizons,也有一个Parents Night Out的项目:每隔两三个月,这家幼儿园都会选一个周六,从下午四点到晚上十点,家长可以把小孩放在他们那里,他们提供小朋友晚饭,还配他们玩。因为是小朋友平时天天都去的幼儿园,老师也是平时的熟面孔,所以没有陌生感。我们觉得这个项目还挺好的。

双职工家庭没有老人帮忙,工作日的晚饭是个难题。我和朱逢霖后来摸索出来一套对我们家效果不错的做法。

首先,我们的电饭煲是可以定时的。早上出门前把米和水放好,定时定在17:30开始煮饭,这样到家就有新鲜出炉的热腾腾的米饭吃了。这样要好过早上时就把饭煮好,那样的话要保温保一天,到晚上吃的时候口感就不大好了。如果等到晚上到家再开始煮饭,那吃到饭就要很晚了。所以,能定时的电饭煲是双职工家必备的一个神器。另一个神器是慢炖锅。慢炖锅可以炖牛肉羊肉鸡肉什么的。一方面慢炖锅炖的比煎炒出来的要健康一些,吃着健康,油烟也少,另一方面慢炖锅能把肉炖酥了,否则如果要做个牛肉炖土豆,等到回到家再做,要么是煮不烂,要么就得等很久才能开饭。

除了使用可以定时的电饭煲和慢炖锅以外,为了能到家后尽快能吃上晚饭,我们家的菜谱也因此优化了,都是以容易准备容易烧的菜,但同时也兼顾了口味,使郑轶嘉有食欲,能多吃一些。我们做的比较多的是鱼。鱼容易做。我们一般早上出门前从冷冻室里拿一条鳊鱼或一袋带鱼出来,放在冰箱上层的冷藏室里化冻。这样做的好处是不用晚上到家再化冻,否则要么要花很多时间,要么就要用微波炉化冻。我们都觉得微波炉还是要尽量少用。鱼在冷藏室里化冻了以后,晚上我们一到家,第一件事情就是把鱼给蒸上,然后再搞蔬菜。蒸鱼不需要太多关注。煎炒的菜,时间长了会烧焦。蒸的时间稍微多了一点也问题不大,只要锅里水足够不蒸干掉,鱼是不会蒸焦掉的。

基本上我们现在每周五天工作日,一般在家里吃四天,到了星期五会在外面吃。我们基本不买外卖。我们是觉得外卖的东西不放心。倒也不是担心食品安全。美国的食品安全总体来说比国内的要好一点。但饭店里烧出来的菜,重油重盐的,不健康,能少吃点尽量少吃点。我和朱逢霖在这方面的观念是很相似的。我们也都很喜欢吃那些很好吃但不太健康的东西,比如烤羊肉串,小龙虾,水煮鱼,红烧肉,火锅,腌笃鲜,clam chowder,牛排。但每个人就只有这么一点点quota可以吃不健康的食物,超过quota就会三高就会影响健康。所以我们觉得把quota用在外卖的晚餐上是不太划算的。我们现在晚饭自己烧,牛肉都从Whole Foods买有机的(我们家基本不吃猪肉),鸡蛋无论是在Whole Foods、QFC还是Costco买都是买有机的,蔬菜也尽量是有机的,烧的时候少盐少油少高温。这两年我和朱逢霖体检的各项指标都正常,自己做饭做的比较健康是原因之一。

其实没有老人在这边帮忙,累是累了一点,不过也省掉了一些其它的烦恼。经常听身边的人说,也经常听朱逢霖说她在华人或mitbbs上看到,老人在这边帮忙带小孩,和孩子爸妈之间起了观念冲突。另外一些有老人帮忙带小孩的家庭里,老人太宠小孩了,小孩养成了一些不好的习惯,比如老人追着小孩喂饭之类的。这其实并不是在美国的中国人家庭独有的。在国内,和老人在同一个城市的,老人经常来帮忙的,也是类似问题的情况。我们家有点”因祸得福”的是,因为老人都来不了,也就不存在这些困扰了。

因为没有老人帮忙,郑轶嘉上托儿所也上得比较早,一岁半就上托儿所了。很多有老人帮忙的家庭,一般会等到两岁或三岁才送托儿所。我记得看到过一份研究报告,说小孩在一岁到两岁之间开始上幼儿园是最有利于小孩的社交能力和心智发育的。送托儿所送的晚的,可能就少了很多学习怎么和其他小朋友互动的机会。送幼儿园送的晚,一开始的几个礼拜适应起来也会更难一些。

双职工没老人帮忙,累是挺累的。不过在美国这边的中国人家庭,还有不少是只有一个人工作的。少一份收入,也挺累的。最幸福的当然是两份收入,还有老人过来帮忙。不过生活就像打牌,抓到手里的牌有好有坏,如果已经不能换牌了,那就用心把手里的牌打好。

After Automation Ate Testing

Huseyin Dursun, my previous manager, recently wrote a post “Automation eats everything …”, in which he pointed out that manual validation has been eliminated and technology companies are no longer hiring engineers exclusively for testing role. That’s exactly what happened last year in my group, Microsoft Azure. We eliminated test and redefined dev and now we only have software engineers, who write both product code and test code.

Now we have eliminated manual validation and all tests are automated. What’s next? My answer is: more automation. Here is a few areas that I see where we are/will be replacing other human work in the engineering activities with software programs.

1. Automation of writing test automation

Today, test automations are written by engineers. In the future, test automation will be written by software programs. In other words, engineers will write the code which writes test automation. One technique to consider is the model based testing. The idea of MBT has existed for nearly two decades and some companies (including teams in Microsoft, including my own teams) have tried and have got some successes. But by and large, it’s very under-used, mainly because other things aren’t there yet, like the scale, the demand, the maturity in other engineering activities[1], the people, etc..

Another direction that people have been pursuing for at least a decade is the traffic bifurcation. The idea is to run the test instance as a shadow copy of the production instance, duplicate the production traffic to the shadow copy and see if it handles it in the same way as the production copy does. The bifurcation could be real time, or more in a record-and-replay fashion. Twitter’s Diffy is the latest work that I have seen in this direction. I guess there is a long way to go, especially when the SUT is very much stateful and its state has strong dependencies with the states in other downstream systems.

2. Behavioral contract enforcement

Using contracts to define system boundary and doing implementation against contracts is now very common. However, our contracts are mostly about the data schema: the API signature, the structure of the JSON object in the input parameters and response bodies, the RESTful API URL, the WSDL for XML Web Services, file format, response codes and error codes, … These contracts don’t carry much information about the behaviors: how will the entity transit through its state machine, whether an operation is going to be idempotent, whether I must call connection.Open() before doing anything else with it, etc.. In particular, the behaviors related to time. For example, this asynchronous operation is supposed to complete within N minutes; the system will perform this recurring operation every X days; …

Today the behavioral contracts are mostly written (if ever written) in our natural languages in design specifications. The enforcement of such behavioral contracts are done in automated test cases. But there could be some fatal gaps in today’s way. Our natural language is ambiguous. Test cases may not cover 100% what’s written in and implied by the design specification. A more fundamental challenge is that the intention of the automated test cases may drift away as time goes by, meaning: our test automation code use to be able to catch a code bug, but after test code changes and refactoring, one day it will no longer be able to catch the same bug. I don’t think we have a good way to detect and prevent such drift.

I believe the direction is to write the behavioral contract with some formal language, such as the TLA+ specification language created by Leslie Lamport. In a presentation last year, he explained how TLA+ works and how it’s used in some real work. It seems pretty intriguing.

3. Automation of the analysis

In my previous team, as we made the automated tests faster, we found that now the long pole became the time human spent to make sense of the test result. So we developed some algorithms and tools to help us: 1) differentiate whether a failure is a new regression, or just a flaky test, 2) which failed tests are likely to share the same root cause. That was very helpful. In addition, we plan was to totally get rid of signoffs and let the software programs to make the call most of the time.

4. Automation of the workflow

Ideally once my code has left my desktop, the entire desktop-to-production journey should be led by software programs with no human participation (except for intervention/override). Today some companies are closer to that dream (e.g. Netflix’s Spinnaker) and some other companies are farther away. Some smaller/simpler products may have already achieved it, but it remains a challenging thing for complex products. Today CI/CD is a lot more common in the software industry than ten years ago. But in my eyes today’s CI/CD tools and practices more like the DHTML and AJAX things circa early 2000’s. The jQuery/Bootstrap equivalent in CI/CD has yet to come.


5. Integration test in production

Besides replacing more human work with software programs, there is one more thing that we can do better in the test engineering: eliminate the test environment per se and perform all integration tests in production[2]. Integration test is an inevitable[3] phase between passing unit tests and getting exposed to real customers in production. Traditionally in integration tests, the SUT and most of its dependencies runs in the lab that are physically separated from the production instances. There are several big pain points in that approach: a) fidelity[5], b) capacity, c) stability, d) support[6]. Doing integration tests in production will make all these problems disappear. Needless to say, there are some challenges in this, mainly regarding product architect, security and compliance, isolation and protection, differentiation and equality, monitoring and alerting, etc.. I guess next time I will write a post about “The Design Pattern of Integration Testing in Production“.


[1] For example, a team should invest in other more fundamental things like CI/CD before investing in building the model and doing MBT.
[2] “Testing in production” is a highly overloaded term. Someone uses it to refer to A/B testing. Sometime it means a late stage quality gate where the new version is rolled out to a small % of production and/or exposed to a small % of customers. “Integration test in production” is different on two things: i) it’s for low quality code that is still under development, ii) it doesn’t get exposed to customer.
[3] There are some strong opinions against integration tests. The lines like “integration test is a scam” help highlight some valid points. But practically we shouldn’t throw the baby out with the bath water. I am strong believer of “pushing to the left” (meaning: put more tests in unit test and find issues earlier) but I too believe integration test has its place in the outer loop[4]. Even though in the hindsight it might be very obvious that some bugs could have been caught by unit test, it would be a totally different thing when these bugs were unknown unknown.
[4] Outer Loop is defined as the stage between when an engineer has completed their check in and when it has rolled out to production. Depending on the product, this could mean App Store deployments (Mobile) or worldwide exposure (Services and modern Click to Run applications).
[5] Lab is different than production in many ways: configurations, security settings, networking, data pattern, etc. Those differences often hide bugs. Lab doesn’t have all the hardware SKUs that production has, which significantly limits how much we can do in the lab in hardware related testing (e.g. drivers, I/O performance, etc.).
[6] Let’s say the SUT depends on another service Foo. So traditionally in the integration test, we also have Foo instance(s) running in lab, too. When the lab instance(s) of Foo has any issue, the team of SUT will need the team of Foo to help check/fix. But that would be a lower priority for the team Foo, compared to the issues in the live site (production). Plus, the SLA (service level agreement) for lab instances is usually less than 24×7, but we want our integration tests to run all the time.

The Combined Engineering in Azure: A Year Later

Last year in Windows Azure[1], we merged dev and test[2] and switched to the combined engineering model[3].

Recently I have been asked quite a few times about my view of that change. My answer was: it solved a few chronic problems in the traditional dev+test model. It solved these problems fairly easily and naturally. If we didn’t do the combined engineering change, these problems would still be here today:

1. Quality is everyone’s responsibility

We always said: quality is owned by everybody, not just the test team. In the reality, there were always some gaps, more or less. Some developers still had the mentality of “the test team would/should find the bug for me”. Now there is no test team. Software engineers can count on nobody but themselves.

2. Improve testability

Although nobody disagreed with the importance of testability design, often times testability is treated as relatively lower priority by the developers in the traditional dev+test model. When they were under the time pressure, they naturally get the feature implemented first and it took long time for some testability requirements getting honored. The worse was that the developers didn’t have the sense of testability in their mind when they design and write code. Quite some testability issues were found in pretty late stage when it’s too costly/risky to change the design and code.

Now writing test code is a part of the software engineer’s job. They have much strong incentive to improve testability because it will make their own work easier. Plus, they truly learn the lessons of poor testability designs because it hurts themselves.

No more begging to the developers to add an API for my test automation to poll to replace a hard-coded Sleep(10000).

3. Push tests to the left

I had hard time to convince some developers to write more unit tests. This is a true story: a dev in my team wrote a custom lock. I found that there was little unit test of that lock. I asked the dev. He told me he think the scenario tests[4] has already covered it pretty well. I didn’t know what to say. Yes we had code coverage data for unit test. But the hall of shame can only go this far.

Now developers (software engineers) own all the tests. Now they have all the incentives to push the tests to the left[5]: put as much as tests in unit test, because it’s fast, easy to debug and nearly free of noises. The integration test is obviously a less favorable place to put the test: it’s slow, more hassle to debug and more noisy.

4. Hiring and retention

That was really, really, really a challenge all the time. Most college graduates prefer SDE than SDET[6]. Partly because they had little exposure to what the SDET job is about. Partly because they are concerned with the “test” tag. Valid concern. Among the industry candidates, many of those who came from software testing background usually didn’t meet our requirement of coding and problem skills, because in many places outside Microsoft, test engineers were mainly doing what the STE[7] used to do in Microsoft. We ended up having to put a lot of effort in convincing developers from other companies to join Microsoft as SDET, which wasn’t an easy sell.

Now, voila, problem solved. There is no more “test” tag. Everyone is “Software Engineer”. No more SDET wants to switch to SDE to get rid of the “test” tag, because there is no more SDET.

5. Planning and resourcing

We used to do our planning based on dev estimate only. It was understandable. It’s much messier to juggle if every work item has two prices (dev estimate and test estimate). In planning, we assume that for every work item, the test estimate is proportional to the dev estimate (e.g. 1:2, which came from our total test:dev ratio) and we believe the variances in each individual work item will average out. It worked OK most of the time. But there were several times where such model cause significantly under-funded test resources and caused crunch in late stage in the project.

Now when engineering managers and software engineers provide work estimate, the price tag has already included both dev estimate and test estimate. Nobody would underestimate the test cost because they would have to pay for it anyway.


To summarize, that’s the power of the roles and responsibility model. In the past, I was the cook at our home and my wife usually do the cleanup. She always complained that I made the stove and counter-top very messy. Later we made a change: I do both cooking and cleanup (and she took some other housework from me). Then all of sudden I paid a lot of attention to not make kitchen messy because otherwise it would be myself that spend time to clean it up.

p.s. Of course there is also the downside of this change. That would be another topic. But the net is a big plus.


[1] I know I should have called it “Microsoft Azure” rather “Windows Azure”. It’s just the old habit. For us who joined Azure in its early years, we still call it Windows Azure.
[2] Before the merge, we had dev team and test team. Take myself as an example. I was the test manager leading the test team, partnering with the dev manager who led the dev team. My test team was about half of the size of the dev team. In the shift to combined engineering model, we simply merged and became one engineering team of about 70+ people.
[3] Strictly speaking, our shift to the combined engineering did not only include merging the dev and test, but also redefined the role of PM, which now lean toward the market, customer and competition more than the internal engineering activities, and enlarged the role of the new “software engineer” role (which started from the sum of original dev+test) by adding more DevOps responsibilities.
[4] We didn’t differentiate these terms: scenario test, functional test, e2e test, integration test. Our dev did help write quite some functional/scenario tests when test team was running tight. But by and large, the test team owned everything after unit test.
[5] We usually draw a timeline on the whiteboard, from left to the right: the developer changes code in his local repo -> unit test -> other pre-checkin tests -> checkin -> integration tests -> start production rollout -> rollout completed. So “push tests to the left” means push them into the unit test.
[6] SDE = Software Development Engineer. SDET = Software Development Engineer in Test (aka “tester”).
[7] STE = Software Test Engineer. Microsoft had this job title until 2005/2006. STE’s main job responsibility was writing test spec, enumerating test cases, execute test cases (mainly manually), exploratory tests, etc.. Many STEs had very good analytical skills, knowledgeable of our product and good soft skills, but relatively weak in coding, debugging, design, etc..

人生没有A/B Testing

A/B testing是互联网公司常用的一种手段,用来帮助在两种不同的方案中做出选择,比如:这个按钮放左边好还是放右边好,字体用11磅的好还是12磅的好,等等。做A/B testing的时候,产品组会随机抽取一小部分用户,比如总用户的10%,然后把其中的一半(就是5%)放到A组,把另一半放到B组。这些用户在打开网站或App的时候,A组的会看到按钮在左边,B组会看到按钮在右边。然后产品组看一下数据。因为这些用户都是随机抽取的,如果A组买的东西多逗留时间长,那就是A方案好。反之,就是B方案好。

可惜在人生大事上没法做A/B testing。比如,到底是中国好还是美国好?要移民么?

理论上来说这种A/B testing要做也是可以做的。我们可以随机找10万个人,随机分成两组。A组五万个人留国内,B组五万个人移民去美国,十年后调查一下这两组人分别过的怎么样。首先不说安排五万个随机抽取的人移民美国的难度。就算这些能搞定,十年后做调查的时候,怎么来衡量“过的怎么样”呢?事业、家庭、金钱、健康、幸福感,这些都要考虑,但各给多少权重呢?

就算最后结果出来了,但那时候的中国已经不是十年前的中国了,美国也不是十年前的美国了,过去十年的试验结果对未来十年已经没多少可参考性了。

既然A/B testing指望不上,那就只能靠不那么具有客观性、科学性的材料做参考了:听听别人怎么说的,看看别人写的心得。每一个单个的人所说的和所写的都或多或少是盲人摸象。所以兼听则明是需要的。但外面的那些文章,大量的是以讹传讹。而很多写亲身经历的,难免掉入“距离产生美”的陷阱。夹叙夹议的,往往会变成“小马过河”:小牛说河水很浅,小羊说河水很深。其实河水就是这点深,到底是太深还是太浅取决于过河的那个人的自身。小马只有自己去河里过一过,才知道河水对它来说是太深还是太浅。毛主席说,梨子的味道要尝过才知道。可是浅尝是不够的。靠出差、旅游和短住是不足以真正了解这只梨子的滋味的。

说到这里,我也不知道我想要说的是什么。

If You Pay Later, You Pay More

One of my previous managers used to tell us “You either pay now or pay later. If you pay later, you pay more”. Years have passed and I have seen how true it is for an engineering team.

The dilemma is: the one who chooses not to pay now may not be the same one who pays later. Why would I pay now, so that someone else wouldn’t pay more later? It’s natural thing that we make selfish choices, unless there is something to counter balance it.

Here is an example, a real live site incident that happened recently. Our customer couldn’t start the virtual machine from the management portal. The cause was in the following code, it threw NullReferenceException because roleInstance.Current was null:

foreach (RoleInstance roleInstance in this.RoleInstances)
{
    int currentUpdateDomain = (int)roleInstance.Current
                                               .Container
                                               .ServiceUpdateDomain;
    //...
}

When the developer pressed “.” after roleInstance.Current, he probably didn’t pause and ask himself: would the Current always be not null? He probably didn’t spend time to read the related code a bit to find out and put extra code there for safety (e.g. “if(roleInstance.Current!=null)“). If he did all these (the pause, the code reading and the additional code), he would be slower. But that would have saved so much more associated with the live site incident: people’s time spent on investigate the incident, the time to rollout the hotfix, and the time to handle the (unhappy) customer. But those time is not the developer’s time. By cutting some corners, he probably got a few more work items done. Thus, he probably got a somewhat better performance review and promoted a bit sooner. Then he moved on and leave the team behind to “pay later but pay more”.

Our performance review model doesn’t help, either. In the annual review cycle, we barely can hold people accountable for something they did more than a year ago. Once bonus are paid and promotions are done, unless it’s something really bad (like causing the subprime crisis), we are not going to take the bonus back or revert the promotion.

Among the things that we can do, one thing I did was to keep my team members’ ownership unchanged[1] for long time (e.g. two years, if not more) and told them so upfront. The benefits are:

  • By fixing people on the same thing for longer time, the one who chooses not to pay now would be more likely the same person who will pay later (and pay more).
  • By telling them so upfront, it does not only counter-balances the shortsighted cutting-corners, but also encourages the right behavior and investments in their areas that will lead to long-term successes. It’s like if I know I am going to live in this house for at least five years, I will spend the first year cleaning up the weeds and fixing the irrigation system in the backyard, then plant the plum trees in the second year and keep fertilizing and take good care of it in the third and fourth year, so that from the fifth year onward, I get to eat the sweat plums while enjoying the sun and breeze in my backyard.

That’s why re-org has a downside. In some companies where a re-org happens every 18-24 months, although the organizations get to more frequently optimize their structure and alignment, it also sets a norm that discourage long-term investments and successes: why bother planting the plum trees if I know I am going to move to another house every 18-24 months?

As Reid Hoffman said: “Good managers know that it’s difficult to achieve long-term success without obtaining long-term commitments from employees.”


[1] I usually did it in a mixed way: some fixed ownership + some flexibility of changing projects once a while.