They Are Not Tech Companies

I was listening to a podcast lately and they were talking about a tech startup, Wevorce, which disrupts the divorce market:

A system that works by attracting couples to the service, collecting data on them through an initial survey, and using their results to classify each person as a particular divorce “archetype.”

Then, the Wevorce team of counselors, family planners, and lawyers steps in. They use their research, data, and training to mediate at predictable moments of tension — a processing system kind of like TurboTax or H&R Block. 

How is that a tech company? What is the tech here? Is filling an online survey considered “using technology”? To me, that is a law company. A law startup. Not a tech start up. I fill a survey form when I visit a physical therapist for the first time. If that form is done online and they have an algorithm to analyze my profile to recommend the best therapist and treatment plan, is the hospital considered a tech company? Of course not.

To me, tech companies are those who advance the technologies and make innovations in technology. If a company makes innovation in another trade, using the help from the latest technologies, it’s not a tech company. For example, Blue Apron is not a tech company. They are a meal kit company. It is still a great startup, a great business innovation. I am a customer and I like it.  

For the same reason, Instacart, of which I am a customer too, is not a tech company either. They do provide a new experience of buying groceries. But at the end of the day, they are a grocery store. An online grocery store. Putting a storefront online and providing an app for customers to place order doesn’t make it a tech company. ToysRUs sells toys online, but no one calls ToysRUs a tech company. 

They are not tech companies also because the technology is not the key ingredient to found those companies and make them successful businesses. A tech person (like me) don’t have know-how in those business sectors. Instacart? Maybe OK. But definitely not Wevorce or Blue Apron. Wevorce was founded by a family lawyer and Blue Apron was started by a chef and a VC. 

In these cases, technology (mobile, data, etc.) is more like the enabler and catalyst. Technology can give these companies an edge over the disruptees in the trade. But if they don’t get the core of their trade right, technology won’t matter. If the spinach in Blue Apron’s big box had already wilted when it arrives my door steps, if they recipes tasted no much difference than average family meals, they would not have been successful. 

Don’t take me wrong. Instacart and Blue Apron are still awesome business innovations. Just don’t call them tech companies any more.

My Four-Buckets Engineering Velocity Model

When it comes to looking into bottleneck and improvement opportunities in the engineering velocity area, I use a four-buckets model, in terms of how long a task takes:

  1. Instant. This is something that only takes a few seconds to half a minute. Tasks like running a small set of unit tests, compiling a sub-folder or do a “git pull” for the first time in the last several days are in this bucket. While waiting for such tasks to finish, I don’t leave my desk. I would catch up on some quick conversations on IM, take a peek on my cellphone or reply an email while waiting.
  2. Coffee break. A coffee break task takes a few minutes, such as apply my private bits to a one-box test instance, run a large set of unit tests, etc.. Some time I go for a coffee or use restroom when such tasks are running.
  3. Lunch break. When a task takes longer time, such as half an hour or 1+ hour, I will grab a lunch while it’s running. Sometime I start the task when I leave office to pick up my boy and check the result when I get home.
  4. Overnight. Such task takes quite a few hours, or up to about half day. So we have to run them overnight: usually start at the night, go to sleep and check the result when we wake up in the next morning. If it’s started in the morning, we probably are not going to see the outcome until the evening.

Over the years, I have learned a few things in this four-buckets model:

  • A task’s duration will slowly deteriorate within the same bucket without being noticed, until it’s about to fall into the next bucket. For example, the build time of a code base may be 10 minutes in the beginning, which put it in the coffee break bucket. It can get slower over the course of the next several months, become 15 minutes, 20 minutes, …, as more code are added. Few will notice it, or be serious about it, until the build time gets close to half an hour, which is no longer a coffee break task, but a lunch break task. People feel more motivated/obligated to fix things to keep a task remain in the current bucket, than prevent it slowly deteriorating within the same bucket.
  • For maximum effect, when we make engineering investments in shortening a task’s duration, we should aim to move it into the next shorter bucket. Incremental improvements within the same bucket will have less impact on engineering velocity. For example, if an overnight task is shortened from 12 hours to 6 hours, it’s still an overnight task. But if it can be further shortened to 3 hours, that will transform the work style in the team: the team will be able to run the task multiple times during the day. It will dramatically change the pace of the team.
  • Incremental improvements within the same bucket are less likely to sustain, due to the first observation mentioned above. It’s going to be like Sisyphus rolling the stone uphill. Unless the stone is rolled over the hill, it will go back down to where it started. To avoid such regression and frustration, our investment should be sufficient to move the task into the next shorter bucket, or don’t make the investment and put the time/money/energy somewhere else.
  • There is a big difference between the “Instant” bucket vs. the next two, the coffee break tasks and the lunch break tasks: whether I have a context switch. For the tasks in the instant task bucket, there is no or little context switch. I don’t leave my desk. I remember what I wanted to do. I’m not multi-tasking. Once the task becomes longer and gets into the coffee break bucket, my productivity is one notch down. I have context switch. I have to do multi-tasking. We should try really hard to prevent the tasks in the “Instant” bucket from getting slower and dropping into the coffee break bucket, to save context switch and multi tasking.
  • Similar to the previous point, there is also a big difference between the coffee/lunch break bucket vs. the overnight bucket. On the tasks in the overnight bucket, I do worse than context switch. I sleep. It’s like close the lid of a laptop. It definitely takes much longer time and more effort to get the full context back after a sleep, than after a lunch break. We should try really hard to prevent any task slipping into the overnight bucket. It’s about whether it’s same day or not. Same day matters a lot, especially psychologically: in the past, we didn’t really feel the difference between Prime’s two-days shipping vs. the normal 3-5 days shipping; but when Prime has the same-day shipping, it feels substantially different.

Actually, there is a fifth bucket: “over the weekend”. Such task takes more than a day to run. I didn’t include it in my four-buckets model because if an engineering team ever has one or more critical tasks in the over-the-weekend bucket, they are seriously sick. They are deep in debt (engineering debt) and they should stop doing anything else[1] and fix that problem first, to at least move it into the overnight bucket. In a healthy engineering team, all the tasks can be done over a lunch break or sooner. Everything is same day. There is no overnight task[2]. That’s the turnaround time required to deliver innovations and customer values in the modern world.


[1] Just being exaggerate to highlight the point.
[2] With reasonable exceptions, such as some long-haul tests. Though many long-haul tests that I have seen could be replaced by shorter tests with certain testability designs.

Promotion Is Not a Birthday Gift

This week people in Microsoft are getting their annual review results: how much annual bonus and stock award they are getting, how much is the merit increase, are they getting a promotion or not.

Here is a true story that I’ve just heard today. A friend of mine, Sam[1], has told his manager on this Monday that he is leaving Microsoft to join another tech company in the region. At the same time, his manager delivered the annual review result to him. Surprisingly, he has got a promotion. Although Sam believed he deserved and qualified for a promotion since early this year, for various valid reasons he thought his chance would be slim this time. So he started to prepare for interviews a few months back, talked to a few companies and got the offer. The new job pays significantly higher than what he gets in Microsoft. So the promotion probably won’t change anything. No much loss to Sam.

But it’s a loss to his manager and Microsoft:

  1. The promo is kind of wasted[2]. It could have been given to someone else.
  2. Microsoft has lost a good engineer and there is a cost to replace him.

This is the reason why I never give my team members surprise when it comes to promotion. Promotion is not a birthday gift. “No surprise” is my rule of thumb in people management and other business scenarios. Not even a good surprise like promotion. I always told my people very early that I am getting him/her a promo. Then I keep him/her updated. A typical timeline looks like this:

  • May 1: “I’ve written a promo for you. Please take a look and let me know what I have missed”
  • May 15: “I have submitted the promo justification”
  • May 29: “I have presented it in the calibration meeting and there wasn’t much push back”
  • June 10: “The promo seems to be a done deal”
  • June 16: “The promo is OK at VP level”
  • July 5: “I haven’t heard any change to your promo”
  • Aug 15: “Here is your annual review result. Congratulations on your promo!”

If Sam knew he’s getting the promo back in May/June, he would likely not start looking outside, hence would not get this offer that no way Microsoft can match and would stay in Microsoft. Microsoft would have kept this talent.


[1] The name is made up.
[2] This statement is overly simplified. Do not misinterpret.

7 Things We Did Right in a Successful Data Migration Project

Someone was asking on Quora about how manage the migration of data when there is a database schema change. I shared how we did in a real data migration project back in 2006/2007. It was a payment system (similar to today’s stripe.com, but ours wasn’t for the public) that ran on .NET XML Web Service + SQL Server. In a much simplified way for ease of writing:

  • It had a Subscriptions database, in which there is the payment_instruments table, where we stored encrypted credit card numbers.
  • Having subscription_id on the payment_instruments table implied that we assume every payment instrument must belong to one and only one subscription.

2015-07-data-migration-before

Now we wanted to support standalone payment instruments, which doesn’t belong to a subscription. So we needed to migrate the payment instrument data into a new payment_methods table in a new Payments database:

2015-07-data-migration-after

It was a very successful data migration project. It had done quite a few things right, which I will repeat in any future data migration projects:

  1. We kept the old payment_instruments table. We added a new payment_method_id field to the payment_instruments table, so that the payment_instruments table acts as a proxy. The benefit is: we can keep most of the legacy code untouched, which can continue consume the payment_instruments table. We just need to change the data access layer a bit, to back fill the encrypted credit card number from the new payment_methods table, when all other legacy code is querying the payment_instruments table.
  2. We added a payment_method_migration_state field to the old payment_instruments table. This field is to indicate whether the old or the new table is the source of truth. We used an explicit field to be the indicator, rather than use an inferred value (for example, by looking at whether the encrypted_credit_card_number field is null in the old payment_instruments table), because an explicit and dedicated indicator of migration status is much less confusing than inferred status, which is usually more error prone because it gives something already in use a new meaning (on top of the original meaning). Also, the explicit indicator serves as a lock a little bit: when a migration is in progress, some update operation should be blocked.
  3. We use both online and offline migration. Online migration: any time a mutation API is called on a payment instrument, such as UpdatePaymentInstrument or PurchaseOffering (with a certain payment instrument), the migration code is triggered and runs in the Web frontend, which insert row to payment_methods table, copy over the encrypted_credit_card_number value, back fill the payment_method_id in the old table and set the payment_method_migration_state. Offline migration: we have a standalone tool running in our datacenter, which go through the existing payment instruments and migration them one by one. The reason we had offline migration on top of online migration was because some customers only used our system very infrequently, such as once every three months. We don’t want to wait for three months to migration their data.
  4. Controlled migration at per-customer level. We designed it in a way that we can select a batch of customers to be eligible to do the migration (in both online and offline). In that way, we can start with a very small number (say 100 customers), and expand to 1000, 10000, 10% of the system, then all. We did find some critical bug during the first several small batches.
  5. Due to compliance requirement, we must not keep the encrypted_credit_card_number data on the old table. But we didn’t do the deletion until the entire migration is done done. That’s because if anything seriously goes wrong, we still have chance (even just in theory) to go back to the old data schema. Actually, we did have some bug which messed up data (putting encrypted_credit_card_number on the wrong payment_method_id) and having kept the old data allowed us to redo the migration correctly. It saved the day.
  6. We made the two new fields on the old payment_instruments table Nullable, rather than a default value, to prevent the data page from rearranging for the existing rows (nearly hundreds of millions of them). For the same reason, when we removed the encrypted_credit_card_number data on the old table, we didn’t delete it but set it to an all-spaces string which has the equal width as the original encrypted blob.
  7. During testing, we modified the deployment script to be able to deploy both old and new version of the frontend side by side. Because the AddPaymentInstrument API in the new version will always put data in the new schema. We needed the ability in our test automation to create data in the old schema, in order to test the migration code. This ability is actually not only useful in data migration project, it’s generally useful in online services: it’s always good to know whether the data created by older version(s) can be correctly handled by the new version.

The above 7 things that we have done right will be applicable to future data migration projects that I will do. #6 (preventing data page from rearranging) may be specific to SQL Server, but its spirit is widely applicable: better understand the underlying implementation of the database system, to minimize the performance hit when migrating non-trivial amount of data or touching a lot of rows.

Besides, two more takeaways of mine are:

  1. Have the right expectation. Data migration will be hard. After spending all the time in design, implementation and testing, the actual migration will also take a lot of time. In our project, we ran into weird data patterns in production that we never thought it would be possible. It turned out to be the result of some old code which is now gone (either retired, or fixed as a bug). In production, we also discovered quite some bugs in our migration code that were hard to discover in test environment. It takes many iterations to discover them, fix, test the fix, roll-out the new bits, resume the migration and discover a new issue. It would be helpful if you could get a snapshot of full production data to test your migration code offline. But in some cases, due to security/privacy/compliance, the data to be migrated must not leave the production data center and sanitizing it will defeat the purpose.
  2. Do not do migration of frontend and database at the same time. If you must abandon both the old frontend (e.g. REST API, Web UI, etc.) and old database, do it in two steps: First, do the data migration. Keep the frontend unchanged to customers, and only change the frontend code under the hood to work with the new database. Second, build a new frontend on top of the new database. For sure the two-steps way sound more costly. But in my experience (I have done both ways in different projects), the two-steps way counter-intuitively will end up more cost efficient, less risky, more predictable and more under control.

Better off Financially Working at Startups or Large Companies?

A recent TechCrunch article said:

“Mathematically, talented individuals are certainly better off financially going into a profession or working at a large tech company, where pay is higher and more secure.”

I used to believe the same. However, in the last a couple years after seeing real examples of people I knew in person, I have become not so sure about that. Those examples include (with made-up names and gender):

  • Helen used to work in Microsoft. She joined a not-too-hot startup a couple years ago. The worth of her equity doubled from <$500K to nearly 1M in less than a year.
  • Frank recently took an offer from a local startup in Seattle which offer more than 10% higher base salary than what he earns in Microsoft, let alone stock options.
  • Bob recently told me that he has an offer from a near-IPO company in the Puget Sound area which offers him $200K base salary, which equals to the sum of base salary and annual bonus that he can get in Microsoft.

Financially, they all seem to be better off working for a startup than staying in Microsoft. So, is the TechCrunch article wrong (at least on the “higher” part)? To me, TechCrunch has pretty good creditability on the tech startup matter. The author and editors must have a lot more data, visibility and network resource than I do. So they must have a fuller picture and maybe my data set is too small.

How can I find the truth[1]? What about the Glassdoor model? I am not sure. Glassdoor is kind of close when it comes to finding the median of Microsoft salaries in the Seattle area. But unlike the base salary which is well defined and comparable across the board, the financial return of working at startups is far more complex.


[1] The reason why I am seeking the truth regarding whether mathematically it is better off financially working at a large tech company than a startup, is just for my curiosity (“There are those who seek knowledge for the sake of knowledge; that is Curiosity.”Saint Bernard of Clairvaux). It’s not going to make me neither more nor less lean toward a startup job. In fact, I had an offer from a late stage startup not long ago. I didn’t go, though.
[2] I found two posts interesting: Startup employees don’t earn more and Big company vs. startup work and pay

Acknowledge Our Lack of Empathy

I had a woman employee a few years ago. She wasn’t always available in the office. She told me it was because her children: the boy was sick, the girl had to stay at home, she needed to pick them up because the dad couldn’t on that day, she needed to prepare meals, etc.. At that moment, I didn’t have kids. I was married, though. I told her “I can understand”. But later, when I had my own child, I realized that earlier I didn’t understand her situation at all. People who don’t have child just don’t get the kids thing, no matter how sympathetic you are.

I had a woman manager reporting to me. She wasn’t married and had no child. She had an employee who got pregnant. I was having a chat with the manager about how to support a pregnant employee. She said “I can understand”. I told her “No, you don’t understand”. I knew how hard it is, because my wife had just gone through the pregnancy a year ago (side note: later it turned out that pregnancy is a piece of cake compared to the first six month after the birth). The woman manager was a very nice person. It’s just that there is no way one can understand what it is like being pregnant unless you have been there.

I have been having some lower back problem in the last a couple weeks. It’s painful. It takes a lot effort and time to put on socks and shoes, get into and out of my car and I even hesitate to walk from my office to someone else’s just down the hallway. People in the office see that I am in pain and wish me get better soon. When I dropped the marker pen, they picked it up for me. I really appreciate their kindness and understanding. Now I think I really understand what it is like to have lower back problem and will be truly empathetic in the future when my team member has lower back problem too.

Conclusion:

We must have the self-awareness that we don’t really understand a difficulty unless we have been there ourselves. In that case, maybe it’s better to acknowledge our lack of empathy. Rather than saying “I can understand”, we could say “I have never had lower back problem myself, so I couldn’t feel your pain. But I am willing to help. Let me know what I should do/should not do”.

Bowling Is the Worst Team Morale Idea

Over the years, I have taken my teams to, participated or heard of various kinds of team morale events, including: bowling, whirlyball, movie, scavenger hunt, boat chartering, go kart, curling, indoor skiing, day of caring, iron chef, family fun center, pool/billiards, laser tag, kara ok, …

But not all of them are good for a team morale event. Here is how I define “good”:

1. Easier to blend people

Team morale event is a great place for team members to get known with others who they didn’t get chance to work closely before. However, people naturally tend to hang out with people they are familiar with, since it’s their comfort zone, and stay there. It’s more of a concern for software companies, because lots of the engineers are introvert and passive in social. So a good morale event must make it easier and more natural for engineers to switch group.

In bowling, each lane goes on different pace. There is hardly a good timing when two lanes both have finished at the same time to swap people. Plus, someone may hesitate to join the big boss’ lane, as they don’t want to be interpreted as an ass-kisser. Scavenger hunt isn’t good either: we get split into teams and stay with our own team through out the whole hunt.

In whirlyball and curling, switching group is less awkward and less likely to get overly interpreted. People can switch sides before/after a game.

In the day of caring event, introvert people can switch group easily and naturally, too. People in the upstairs room shouted “we need someone to help move the furniture” and a big guy put down his yard work and came to help.

2. Not something that I can do myself

When I choose morale event idea, I prefer things that I can’t do myself, due to reasons like affordability, the requirement of minimum number of people, etc. 

For example, I can’t do these myself and hence they are better choice for team morale events:

  • Curling: it takes 8 people (4v4)
  • Whirlyball: it’s usually 5v5 or 6v6
  • Scavenger Hunt: it’s best for 20-30 people
  • Go Kart: the more people the more fun. With 10 or 12 karts on the track, there are a lot of bumping, passing and laughing. 
  • Boat chartering: a family’s budget can only get a much smaller boat.

Bowling, movie and family fun center are something I can do with my family during weekends. So they are less preferable as team morale events. 

3. Doesn’t need a lot of practice

Pool and bowling aren’t good for morale event because it takes quite some practice to perfect the skills in order to truly enjoy the game. It’s no fun if my bowling ball always falls into the ditch. Plus, between a newbie and who had played pool/bowling a lot, the skill gap is hard to close in an hour or two. Pool and bowling are only more enjoyable when the players’ skills are nearly level. 

Curling is better. Although to perfect in curling it also takes a lot of practice, few of us in software companies had played it a lot before. Most of us are at beginner level. Same to the Go Kart. Few of us are pro racer and everybody can push the gas and brake and steer.

Day of Caring is even better. I learned useful skills in volunteer work. Last time my team was helping an assisted living place. We were cleaning up the yard and also the interior. I painted the walls, which I never did before. I learned some tips from others and now I feel more confident to paint my own place (maybe starting from the garage).

4. Sense of achievement and shared memory

There was very strong sense of achievement in the day of caring event: before we came, the place was in a poor state. Painting has peeled. Weeds were tall. Walkway was muddy. We painted the walls, cut the weeds and bushes and pave the walkway with gravels. When we left, the place looked much nicer. Being able to make a dramatic change in a day, it feels good. Plus, it was something that “we did together”. Such memory creates a good bond between the team members. 

5. Not cause to feel intellectually inferior

Most of the morale event types that involve competition, including whirley ball, curling, go karting, bowling, etc., are physical competitions. 

But scavenger hunt is more of an intellectual competition: you need to solve puzzles, think of ways to workaround road blockers, etc.. Losing teams feel they are less smart and less good at problem solving. That hurts our ego because we software engineers are intellectual workers. We are proud of our intellectual horsepower and problem solving skills which is what it takes to pass the interview and get our job. 

So I would avoid any morale event types that are intellectual competition. 

6. No individual ranking 

In events like Go Kart, everybody gets a score. Your score is higher than my score. Mine is higher than his. Haven’t we had enough of such in office already: everyone get a number each year; your number is better than my number, so you get more bonus than I do and you get promo while I don’t. We have all been sick of ranking in office. Better not have another ranking when we have fun.

Conclusion

Based on the above six criteria, here is how these morale event ideas score (if ‘x’ is 0, ‘v’ is 1 and ‘+’ is 2). Not surprisingly, bowling is the worst.

2015-07-team-morale

Make Breaking Changes Without API Versioning

Lots of REST APIs are versioned: Microsoft Azure Service Management API, AWS EC2 API, Facebook Graph API, Twitter REST API, Strip, etc.. The motivation behind API version is obvious: to make it easier to introduce breaking changes without disrupting existing client code. As Facebook’s API doc says:

“The goal for having versioning is for developers building apps to be able to understand in advance when an API or SDK might change. They help with web development, but are critical with mobile development because a person using your app on their phone may take a long time to upgrade (or may never upgrade).”

However, API versioning sometimes can incur substantial engineering cost. Take Microsoft Azure Service Management API for an example. Currently it has more than 20 versions, among which the first one, 2009-10-01, was released six years ago and is still in use. Customers’ expectation is that as long as they stick with the same version (e.g. 2009-10-10), their client code won’t never need to change. Assume there are 1,000 test cases for the Service Management API, therefore in order to deliver on that expectation, every time when Azure is upgrading the Service Management API, theoretically it has to run the 1,000 test cases for 20+ times, one for each version from 2009-10-10 all the way to 2015-04-01! Having a retirement policy like Facebook’s will help a bit, but even in Facebook’s case, they still have to run all the test cases for 5 times in theory (from v2.0 to v2.4).

The engineering cost problem won’t change too much whether the multiple versions are served by the same piece of code/binary, or separated pieces:

  • You may implement the API to only have the outer layer versioned and most of the core/backend code version-less, hoping to save some test needs. But that will actually increase the chance of accidentally changing the behavior of older API versions since they share the same core/backend code.
  • Making the core/backend code also versioned gives a better isolation between versions, but it has a lot more code paths that need to be covered in testing.
  • Forking a maintaince branch for each version may sound appealing since it mostly eliminates the need to run full blown testing for v2.0, v2.1, v2.2 and v2.3 when you release v2.4, since v2.0, v2.1, v2.2 and v2.3 are in their own branches. But on the other hand, applying bug fix across multiple maintenance branch may not be as trivial as it sounds. Plus, when multiple versions of binaries running side by side, data compatibility and corruption problems become real.

I was a proponent of API versioning until I have seen the cost. Then I went back to where we started: what are the other ways to make it easier to introduce breaking changes without requiring all client code to upgrade?

Policy-driven may not be a bad idea. Let’s try it on a couple real examples of breaking changes:

  1. Azure Service Management API introduced a breaking change in the version 2013-03-01: “The AdminUsername element is now required“. Instead of introducing a new API version, we could add a policy “AdminUsername element must not be empty” and start with “Enforced = false” for every subscription. Once enforced, any API call with empty AdminUsername will fail. Individual subscription can turn on the enforcement by themselves in management portal, or it will be forced to turn on after two years (equivalent to Facebook’s two years version retirement policy). Once it’s turned on, it can’t be turned off. During the two years, there might be some other new features that require the “AdminUsername element must not be empty” policy to be turned on. It’s up to each subscription whether they want to delay the work of changing the code around AdminUsername at the cost of delaying the adoption of other new features, vs. pay the cost of changing code now to get access to other new features sooner.
  2. Facebook Graph API has deprecated “GET /v2.4/{id}/links” in v2.4. As an alternative to introducing a new API version, we could add a policy “Deprecate /{id}/links”. It would be a per App ID level policy and can be viewed/set in the dashboard. App owner will receive reminders when the deadline of turning on the policy is approaching: 12 months, 6 months, 3 months, 4 weeks, 1 week, …

The policy-driven approach will have different characteristic than the API versions when it comes to discover the breaking changes. In the API version way, when the client switch to v2.4 in its development environment, errors will pop up right away in the integration test if any hidden client code is still consuming APIs deprecated in v2.4. In the policy-driven way, the developer would need to use a different App ID, which is in development mode, to turn the policy on and run the integration test. I don’t see fundamental difference between these two ways. They each have pros and cons. Advancing only one number (the API version) may be less work than flipping a number of policies, but policies give the client more granular control and flexibility.

At the end of the day, API versions may still be needed, but only for a major refactoring/redesign of the API, which would only happen once every a couple years. Maybe it’s more appropriate to call it “generations” rather than “versions”. Within the same generation, we can use the policy-driven approach for smaller breaking changes.

The HipChat Stress

Email stress has been widely acknowledged for some years. Email is a major source of pressure in workplace as people feel the obligation to respond quickly. While we are still searching for solutions to cope with email stress, unfortunately, a newer and even worse source of pressure has emerged: the HipChat stress.

Here is a real story. I was having an in-person conversation with a lady lately on a weekday afternoon. She works for another software company where HipChat is used as much as emails, if not more. Our conversation was about an important matter, so I put aside all my electronic gadgets to focus on the conversation, as well as to pay respect. The lady had her laptop opened next to her while we were talking, as she said she wanted to stay online. Our conversation was interrupted for several times, because she noticed that someone was “at-ing” her on HipChat. Since it wasn’t an emergency (e.g. live site incident), I asked her why she felt the obligation to respond right away. She said that’s because it was on a group channel.

Later that day, I couldn’t stop wondering what’s the difference between (a) a group channel on HipChat or Slack, vs. (b) an email thread which has the whole team on it. It appeared to me that our brains seem to equate a group channel in HipChat to a real team meeting which has everybody in the same room (btw, such illusion is a testimony of HipChat and Slack for successfully bringing the team closer together.) In a real team meeting, of course we feel obligated to respond when our names are called. Hence we feel the same when being at-ed in a group channel in HipChat or Slack.

As the instant messaging services like HipChat and Slack are gaining popularity at an unprecedented pace, I guess that the HipChat stress that I observed on that lady will also soon become very common and probably dwarf the email stress. A hundred years after Henry Ford installed the first moving assembly line in the world, HipChat and Slack are becoming the new assembly line, for the software engineers.

The Self-Link Nonsense in Azure DocumentDB

Lately I have been playing with the Azure DocumentDB. It’s really good: it’s more truly a Database-as-a-Service than other hosted/managed NoSQL providers; it supports many good features that not everyone supports, like the distinction between replace vs. update; its price seems substantially lower than the others’ (MongoLab, RavenHQ, MongoHQ).

However, I am not very happy with DocumentDB’s client side programming experience. Overall, the implementation of the data access layer on top of DocumentDB is probably the most bloated among all the document store databases that I’ve used, including MongoDB, RavenDB and RethinkDB. In particular, there is one thing really annoying in Azure DocumentDB: the self-link. Due to the need of self-link in various places, the data access layer code for DocumentDB is fairly awkward.

What I meant by “bloated” was that to achieve the same thing (e.g. to implement the DeleteOrderByNumber method in the below example), I only need 1 line code on MongoDB but a lot more code on DocumentDB:

For MongoDB:

    collection.DeleteOneAsync(o => o.OrderNumber == orderNumber).Wait();

For DocumentDB:

    Order order = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.OrderNumber == orderNumber)
        .AsEnumerable().FirstOrDefault();
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == order.Id).AsEnumerable().FirstOrDefault();
    client.DeleteDocumentAsync(doc.SelfLink);


Let’s go through the full example. I have an Order document, in which the Id is a GUID generated by the database during insert and the OrderNumber is a user-friendly string, such as “558-4094307-8688964”.

    public class Order
    {
        public string Id; 
        public string OrderNumber; 
        public string ShippingAddress;
    }

The next thing I want to do is to implement the data access layer to add, get, update and delete an Order document in a document store database. In particular, I need a Get method and a Delete method which takes OrderNumber as parameter, because the client will also need to call the REST API using the order number. So basically I need to implement the below methods (I’m using C#):

    void AddOrder(Guid order);
    Order GetOrder(string id);
    Order GetOrderByNumber(string orderNumber);
    void UpdateOrder(Order order);
    void DeleteOrder(string id);
    void DeleteOrderByNumber(string orderNumber);

For each method, I compared the implementation on each database. The comparison is mainly focused on the amount of code and how straightforward and intuitive it is to code.

MongoDB RethinkDB RavenDB DocumentDB
AddOrder
GetOrder
GetOrderByNumber
UpdateOrder
DeleteOrder
DeleteOrderByNumber

Here is the code and why I gave these ratings:

1. AddOrder

/* MongoDB */
void AddOrder(Order order)
{
    collection.InsertOneAsync(order).Wait();
}

/* RethinkDB */
void AddOrder(Order order)
{
    order.Id = conn.Run(tblOrders.Insert(order)).GeneratedKeys[0];
}

/* RavenDB */
void AddOrder(Order order)
{
    using (IDocumentSession session = store.OpenSession())
    {
        session.Store(order);
        session.SaveChanges();
    }
}

/* DocumentDB */
void AddOrder(Order order)
{
    Document doc = client
        .CreateDocumentAsync(collection.SelfLink, order).Result;
    order.Id = doc.Id;
}

Not too bad. Extra credit to MongoDB and RavenDB: their client lib can automatically back-fill the DB generated value of Id to my original document object. On DocumentDB and RethinkDB, I need to write my own code to do the back-fill. Note: I’m using the latest official client driver for MongoDB, RavenDB and DocumentDB. For RethinkDB, they don’t have official .NET driver, so I am using a community-supported .NET driver for RethinkDB.

2. GetOrder

/* MongoDB */
Order GetOrder(string id)
{
    return collection.Find(o => o.Id == id).FirstOrDefaultAsync().Result;
}

/* RethinkDB */
Order GetOrder(string id)
{
    return conn.Run(tblOrders.Get(id));
}

/* RavenDB */
Order GetOrder(string id)
{
    using (IDocumentSession session = store.OpenSession())
    {
        return session.Load(id);
    }
}

/* DocumentDB */
Order GetOrder(string id)
{
    return client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == id).AsEnumerable().FirstOrDefault(); 
}

One up for RethinkDB and RavenDB. When I use Id to query, I shouldn’t need to write a search condition like x => x.Id == id.

3. GetOrderByNumber

/* MongoDB */
Order GetOrderByNumber(string orderNumber)
{
    return collection.Find(o => o.OrderNumber == orderNumber)
        .FirstOrDefaultAsync().Result;
}

/* RethinkDB */
Order GetOrderByNumber(string orderNumber)
{
    return conn.Run(tblOrders.Filter(o => o.OrderNumber == orderNumber))
            .FirstOrDefault();
}

/* RavenDB */
Order GetOrderByNumber(string orderNumber)
{
    using (IDocumentSession session = store.OpenSession())
    {
        return session.Query()
            .Where(x => x.OrderNumber == orderNumber).FirstOrDefault();
    }
}

/* DocumentDB */
Order GetOrderByNumber(string orderNumber)
{
    return client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.OrderNumber == orderNumber)
        .AsEnumerable().FirstOrDefault();
}

4. UpdateOrder

/* MongoDB */
void UpdateOrder(Order order)
{
    collection.ReplaceOneAsync(o => o.Id == order.Id, order).Wait();
}

/* RethinkDB */
void UpdateOrder(Order order)
{
    conn.Run(tblOrders.Get(order.Id.ToString()).Replace(order));
}

/* RavenDB */
void UpdateOrder(Order order)
{
    using (IDocumentSession session = store.OpenSession())
    {
        session.Store(order);
        session.SaveChanges();
    }
}

/* DocumentDB */
void UpdateOrder(Order order)
{
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == order.Id).AsEnumerable().FirstOrDefault();
    client.ReplaceDocumentAsync(doc.SelfLink, order).Wait();
}

DocumentDB needs an extra step! I have to do a separate query by Id first, to get back a Document object, then use the SelfLink value on the Document object to call ReplaceDocumentAsync. I don’t understand why the syntax has to be like that.

5. DeleteOrder

/* MongoDB */
void DeleteOrder(string id)
{
    collection.DeleteOneAsync(o => o.Id == id).Wait();
}

/* RethinkDB */
void DeleteOrder(string id)
{
    conn.Run(tblOrders.Get(id).Delete());
}

/* RavenDB */
void DeleteOrder(string id)
{
    using (IDocumentSession session = store.OpenSession())
    {
        session.Delete(id);
        session.SaveChanges();
    }
}

/* DocumentDB */
void DeleteOrder(string id)
{
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == id).AsEnumerable().FirstOrDefault();
    client.DeleteDocumentAsync(doc.SelfLink);
}

Same as in GetOrder, extra point for RethinkDB and RavenDB for not needing a search condition x => x.Id == id when Id is used.

6. DeleteOrderByNumber

/* MongoDB */
void DeleteOrderByNumber(string orderNumber)
{
    collection.DeleteOneAsync(o => o.OrderNumber == orderNumber).Wait();
}

/* RethinkDB */
void DeleteOrderByNumber(string orderNumber)
{
    conn.Run(tblOrders.Filter(o => o.OrderNumber == orderNumber).Delete());
}

/* RavenDB */
void DeleteOrderByNumber(string orderNumber)
{
    using (IDocumentSession session = store.OpenSession())
    {
        var order = session.Query()
            .Where(x => x.OrderNumber == orderNumber).FirstOrDefault();
        session.Delete(order);
        session.SaveChanges();
    }
}

/* DocumentDB */
void DeleteOrderByNumber(string orderNumber)
{
    Order order = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.OrderNumber == orderNumber)
        .AsEnumerable().FirstOrDefault();
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == order.Id).AsEnumerable().FirstOrDefault();
    client.DeleteDocumentAsync(doc.SelfLink);
}

MongoDB and RethinkDB are the best for DeleteOrderByNumber. They both only need 1 call. RavenDB needs 2 calls: it first needs to query by OrderNumber, then do the Delete (which presumably will use the Id). DocumentDB is the worst as I need to do 3 calls! Before I can call DeleteDocumentAsync, I first need to do a query by OrderNumber to get the Id, then use Id to query again to get the self-link of this Order document! DocumentDB’s client driver seems to only have one method for delete: DeleteDocumentAsync, which only takes a SelfLink string.

I don’t understand why there isn’t an overload of DeleteDocumentAsync which can take Id. It doesn’t seem to be just me. There are 300 votes on feedback.azure.com asking for the support of deleting a document by id.

Summary

Overall, the data access layer implementation on DocumentDB is a bit inferior experience than on the other three. I hope the DocumentDB team can improve it in the near future.


Foot Note 1:

I was advised that if my Order object is extended from the Microsoft.Azure.Documents.Resource type, it will already have the SelfLink property on it and I will not need the extra step in UpdateOrder and DeleteDocumentAsync.

It works but not acceptable to me. Having the Order object extended from Resource will pollute my domain model. Usually we want our domain model objects to be free of dependencies, so that it works the best for the interoperability across different layers and stacks.

Although strictly speaking, the Order object isn’t in 100% purity on MongoDB. I needed to put a [BsonId] attribute on the Id property. But an attribute is much better than additional member fields introduced by extending from a type in a specific DB’s client driver. For example, one of the major difference is that: in JSON serialization, attributes won’t show up but member fields will.

Foot Note 2:

The Order object was defined slightly differently on each DB. For completeness, here is the exact definitions:

    /* MongoDB */
    [Serializable]
    public class Order
    {
        [BsonId(IdGenerator = typeof(StringObjectIdGenerator))]
        public string Id;
        public string OrderNumber;
        public string ShippingAddress;
    }

    /* RethinkDB */
    [DataContract]
    public class Order
    {
        [DataMember(Name = "id", EmitDefaultValue = false)]
        public string Id;
        [DataMember]
        public string OrderNumber;
        [DataMember]
        public string ShippingAddress;
    }

    /* RavenDB */
    public class Order
    {
        public string Id = string.Empty;
        public string OrderNumber;
        public string ShippingAddress;
    }

    /* DocumentDB */
    public class Order
    {
        [JsonProperty(PropertyName = "id")]
        public string Id;
        public string OrderNumber;
        public string ShippingAddress;
    }


[Update] In August 2015, Azure DocumentDB announced that they have made improvement on the Self-Link thing.

WhatDidIDo.com

In 2008, I wrote a small tool WhatDidIDo for myself, because I wanted to know where I spent my time. In those days my stress level was a bit high, spending a lot of hours in front of my PC while I still felt that I didn’t have enough time to get all the work done.

The WhatDidIDo program was simple. It was written in Visual C++. It ran in the background, using SetWindowsHookEx() to capture the events which a window was activated and GetWindowThreadProcessId() and GetModuleBaseName() to find out which application does the activated window belong to. Then the program just wrote the data into a comma deliminated CSV file.

The tool worked well and later used by a few other team members too, as they also wanted to understand where they wasted spent their time. Here is the data from one of them, who was among the best developers we had at that moment:

2015-04-whatdidido

A few days ago I saw someone recommending the RescueTime app. It instantly reminded me about WhatDidIDo. Isn’t it the same idea? How come I didn’t even think of turning WhatDidIDo into a startup company?

On the other hand, of course, turning WhatDidIDo into a startup company wouldn’t guarantee success. Actually there are quite a few similar apps out there, like ManicTime, which are not as hot as RescueTime. There would be a long way to go for WhatDidIDo.com to become today’s RescueTime. 

When LINQ-to-SQL Meets Partitioned View‏

When SQL Server (including SQL Azure) is the database for my projects, I like to use LINQ-to-SQL. Its attribute-based mapping is pretty neat. I also like to use Partitioned View, which makes it easy and fast to purge old data: just drop tables rather than run DELETE commands.

Recently, in a new project when I used LINQ-to-SQL and Partitioned View together, I ran into such an error:

System.Data.SqlClient.SqlException: The OUTPUT clause cannot be specified because the target view "FooBar" is a partitioned view.

I wasn’t able to find a good answer in Bing/Google/StackOverflow. It seemed to me that I might have to look into the source code of System.Data.Linq to find out what was the exact SQL command that LINQ-to-SQL generated and why there was an OUTPUT in there. When I was about to start this laborious source code reading journey, I happened to look at my entity class again and suddenly realized “wait a second, could the problem be the IsDbGenerated = true and AutoSync = AutoSync.OnInsert flags?”:

[Column(Name = "guid_row_id", CanBeNull = false, IsPrimaryKey = true, 
        IsDbGenerated = true, AutoSync = AutoSync.OnInsert)]
public Guid RowId;

“Yeah, that would make sense”, I thought, because if I were to write LINQ-to-SQL by myself, I would probably too have used the OUTPUT clause to implement the IsDbGenerated and AutoSync flags. So I removed IsDbGenerated and AutoSync:

[Column(Name = "guid_row_id", CanBeNull = false, IsPrimaryKey = true)]
public Guid RowId;

Voilà, the error was gone!

Eventually it turns out that the culprit was IsDbGenerated. So as a workaround, I changed my code to generate new row IDs with Guid.NewGuid() in the application code. It’s fine for my project since it’s just a Guid. I guess this issue, that IsDbGenerated in LINQ-to-SQL doesn’t work with partitioned views, would be more troublesome if someone wants to use other DB generated value like GetUTCDate(), which could be quite useful to avoid the time drift issue on the client side, or auto increment integer.