Tech

Bowling Is the Worst Team Morale Idea

Over the years, I have taken my teams to, participated or heard of various kinds of team morale events, including: bowling, whirlyball, movie, scavenger hunt, boat chartering, go kart, curling, indoor skiing, day of caring, iron chef, family fun center, pool/billiards, laser tag, kara ok, …

But not all of them are good for a team morale event. Here is how I define “good”:

1. Easier to blend people

Team morale event is a great place for team members to get known with others who they didn’t get chance to work closely before. However, people naturally tend to hang out with people they are familiar with, since it’s their comfort zone, and stay there. It’s more of a concern for software companies, because lots of the engineers are introvert and passive in social. So a good morale event must make it easier and more natural for engineers to switch group.

In bowling, each lane goes on different pace. There is hardly a good timing when two lanes both have finished at the same time to swap people. Plus, someone may hesitate to join the big boss’ lane, as they don’t want to be interpreted as an ass-kisser. Scavenger hunt isn’t good either: we get split into teams and stay with our own team through out the whole hunt.

In whirlyball and curling, switching group is less awkward and less likely to get overly interpreted. People can switch sides before/after a game.

In the day of caring event, introvert people can switch group easily and naturally, too. People in the upstairs room shouted “we need someone to help move the furniture” and a big guy put down his yard work and came to help.

2. Not something that I can do myself

When I choose morale event idea, I prefer things that I can’t do myself, due to reasons like affordability, the requirement of minimum number of people, etc.

For example, I can’t do these myself and hence they are better choice for team morale events:

Curling: it takes 8 people (4v4)
Whirlyball: it’s usually 5v5 or 6v6
Scavenger Hunt: it’s best for 20-30 people
Go Kart: the more people the more fun. With 10 or 12 karts on the track, there are a lot of bumping, passing and laughing.
Boat chartering: a family’s budget can only get a much smaller boat.

Bowling, movie and family fun center are something I can do with my family during weekends. So they are less preferable as team morale events.

3. Doesn’t need a lot of practice

Pool and bowling aren’t good for morale event because it takes quite some practice to perfect the skills in order to truly enjoy the game. It’s no fun if my bowling ball always falls into the ditch. Plus, between a newbie and who had played pool/bowling a lot, the skill gap is hard to close in an hour or two. Pool and bowling are only more enjoyable when the players’ skills are nearly level.

Curling is better. Although to perfect in curling it also takes a lot of practice, few of us in software companies had played it a lot before. Most of us are at beginner level. Same to the Go Kart. Few of us are pro racer and everybody can push the gas and brake and steer.

Day of Caring is even better. I learned useful skills in volunteer work. Last time my team was helping an assisted living place. We were cleaning up the yard and also the interior. I painted the walls, which I never did before. I learned some tips from others and now I feel more confident to paint my own place (maybe starting from the garage).

4. Sense of achievement and shared memory

There was very strong sense of achievement in the day of caring event: before we came, the place was in a poor state. Painting has peeled. Weeds were tall. Walkway was muddy. We painted the walls, cut the weeds and bushes and pave the walkway with gravels. When we left, the place looked much nicer. Being able to make a dramatic change in a day, it feels good. Plus, it was something that “we did together”. Such memory creates a good bond between the team members.

5. Not cause to feel intellectually inferior

Most of the morale event types that involve competition, including whirley ball, curling, go karting, bowling, etc., are physical competitions.

But scavenger hunt is more of an intellectual competition: you need to solve puzzles, think of ways to workaround road blockers, etc.. Losing teams feel they are less smart and less good at problem solving. That hurts our ego because we software engineers are intellectual workers. We are proud of our intellectual horsepower and problem solving skills which is what it takes to pass the interview and get our job.

So I would avoid any morale event types that are intellectual competition.

6. No individual ranking

In events like Go Kart, everybody gets a score. Your score is higher than my score. Mine is higher than his. Haven’t we had enough of such in office already: everyone get a number each year; your number is better than my number, so you get more bonus than I do and you get promo while I don’t. We have all been sick of ranking in office. Better not have another ranking when we have fun.

Conclusion

Based on the above six criteria, here is how these morale event ideas score (if ‘x’ is 0, ‘v’ is 1 and ‘+’ is 2). Not surprisingly, bowling is the worst.

2015-07-team-morale

Published July 19, 2015
Author zhengziying
Category Tech
Comments Leave Comment

Make Breaking Changes Without API Versioning

Lots of REST APIs are versioned: Microsoft Azure Service Management API, AWS EC2 API, Facebook Graph API, Twitter REST API, Strip, etc.. The motivation behind API version is obvious: to make it easier to introduce breaking changes without disrupting existing client code. As Facebook’s API doc says:

“The goal for having versioning is for developers building apps to be able to understand in advance when an API or SDK might change. They help with web development, but are critical with mobile development because a person using your app on their phone may take a long time to upgrade (or may never upgrade).”

However, API versioning sometimes can incur substantial engineering cost. Take Microsoft Azure Service Management API for an example. Currently it has more than 20 versions, among which the first one, 2009-10-01, was released six years ago and is still in use. Customers’ expectation is that as long as they stick with the same version (e.g. 2009-10-10), their client code won’t never need to change. Assume there are 1,000 test cases for the Service Management API, therefore in order to deliver on that expectation, every time when Azure is upgrading the Service Management API, theoretically it has to run the 1,000 test cases for 20+ times, one for each version from 2009-10-10 all the way to 2015-04-01! Having a retirement policy like Facebook’s will help a bit, but even in Facebook’s case, they still have to run all the test cases for 5 times in theory (from v2.0 to v2.4).

The engineering cost problem won’t change too much whether the multiple versions are served by the same piece of code/binary, or separated pieces:

You may implement the API to only have the outer layer versioned and most of the core/backend code version-less, hoping to save some test needs. But that will actually increase the chance of accidentally changing the behavior of older API versions since they share the same core/backend code.
Making the core/backend code also versioned gives a better isolation between versions, but it has a lot more code paths that need to be covered in testing.
Forking a maintaince branch for each version may sound appealing since it mostly eliminates the need to run full blown testing for v2.0, v2.1, v2.2 and v2.3 when you release v2.4, since v2.0, v2.1, v2.2 and v2.3 are in their own branches. But on the other hand, applying bug fix across multiple maintenance branch may not be as trivial as it sounds. Plus, when multiple versions of binaries running side by side, data compatibility and corruption problems become real.

I was a proponent of API versioning until I have seen the cost. Then I went back to where we started: what are the other ways to make it easier to introduce breaking changes without requiring all client code to upgrade?

Policy-driven may not be a bad idea. Let’s try it on a couple real examples of breaking changes:

Azure Service Management API introduced a breaking change in the version 2013-03-01: “The AdminUsername element is now required“. Instead of introducing a new API version, we could add a policy “AdminUsername element must not be empty” and start with “Enforced = false” for every subscription. Once enforced, any API call with empty AdminUsername will fail. Individual subscription can turn on the enforcement by themselves in management portal, or it will be forced to turn on after two years (equivalent to Facebook’s two years version retirement policy). Once it’s turned on, it can’t be turned off. During the two years, there might be some other new features that require the “AdminUsername element must not be empty” policy to be turned on. It’s up to each subscription whether they want to delay the work of changing the code around AdminUsername at the cost of delaying the adoption of other new features, vs. pay the cost of changing code now to get access to other new features sooner.
Facebook Graph API has deprecated “GET /v2.4/{id}/links” in v2.4. As an alternative to introducing a new API version, we could add a policy “Deprecate /{id}/links”. It would be a per App ID level policy and can be viewed/set in the dashboard. App owner will receive reminders when the deadline of turning on the policy is approaching: 12 months, 6 months, 3 months, 4 weeks, 1 week, …

The policy-driven approach will have different characteristic than the API versions when it comes to discover the breaking changes. In the API version way, when the client switch to v2.4 in its development environment, errors will pop up right away in the integration test if any hidden client code is still consuming APIs deprecated in v2.4. In the policy-driven way, the developer would need to use a different App ID, which is in development mode, to turn the policy on and run the integration test. I don’t see fundamental difference between these two ways. They each have pros and cons. Advancing only one number (the API version) may be less work than flipping a number of policies, but policies give the client more granular control and flexibility.

At the end of the day, API versions may still be needed, but only for a major refactoring/redesign of the API, which would only happen once every a couple years. Maybe it’s more appropriate to call it “generations” rather than “versions”. Within the same generation, we can use the policy-driven approach for smaller breaking changes.

Published July 18, 2015
Author zhengziying
Category Tech
Comments Leave Comment

The HipChat Stress

Email stress has been widely acknowledged for some years. Email is a major source of pressure in workplace as people feel the obligation to respond quickly. While we are still searching for solutions to cope with email stress, unfortunately, a newer and even worse source of pressure has emerged: the HipChat stress.

Here is a real story. I was having an in-person conversation with a lady lately on a weekday afternoon. She works for another software company where HipChat is used as much as emails, if not more. Our conversation was about an important matter, so I put aside all my electronic gadgets to focus on the conversation, as well as to pay respect. The lady had her laptop opened next to her while we were talking, as she said she wanted to stay online. Our conversation was interrupted for several times, because she noticed that someone was “at-ing” her on HipChat. Since it wasn’t an emergency (e.g. live site incident), I asked her why she felt the obligation to respond right away. She said that’s because it was on a group channel.

Later that day, I couldn’t stop wondering what’s the difference between (a) a group channel on HipChat or Slack, vs. (b) an email thread which has the whole team on it. It appeared to me that our brains seem to equate a group channel in HipChat to a real team meeting which has everybody in the same room (btw, such illusion is a testimony of HipChat and Slack for successfully bringing the team closer together.) In a real team meeting, of course we feel obligated to respond when our names are called. Hence we feel the same when being at-ed in a group channel in HipChat or Slack.

As the instant messaging services like HipChat and Slack are gaining popularity at an unprecedented pace, I guess that the HipChat stress that I observed on that lady will also soon become very common and probably dwarf the email stress. A hundred years after Henry Ford installed the first moving assembly line in the world, HipChat and Slack are becoming the new assembly line, for the software engineers.

Published April 29, 2015
Author zhengziying
Category Tech
Comments Leave Comment

The Self-Link Nonsense in Azure DocumentDB

Lately I have been playing with the Azure DocumentDB. It’s really good: it’s more truly a Database-as-a-Service than other hosted/managed NoSQL providers; it supports many good features that not everyone supports, like the distinction between replace vs. update; its price seems substantially lower than the others’ (MongoLab, RavenHQ, MongoHQ).

However, I am not very happy with DocumentDB’s client side programming experience. Overall, the implementation of the data access layer on top of DocumentDB is probably the most bloated among all the document store databases that I’ve used, including MongoDB, RavenDB and RethinkDB. In particular, there is one thing really annoying in Azure DocumentDB: the self-link. Due to the need of self-link in various places, the data access layer code for DocumentDB is fairly awkward.

What I meant by “bloated” was that to achieve the same thing (e.g. to implement the DeleteOrderByNumber method in the below example), I only need 1 line code on MongoDB but a lot more code on DocumentDB:

For MongoDB:


    collection.DeleteOneAsync(o => o.OrderNumber == orderNumber).Wait();

For DocumentDB:


    Order order = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.OrderNumber == orderNumber)
        .AsEnumerable().FirstOrDefault();
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == order.Id).AsEnumerable().FirstOrDefault();
    client.DeleteDocumentAsync(doc.SelfLink);

Let’s go through the full example. I have an Order document, in which the Id is a GUID generated by the database during insert and the OrderNumber is a user-friendly string, such as “558-4094307-8688964”.


    public class Order
    {
        public string Id; 
        public string OrderNumber; 
        public string ShippingAddress;
    }

The next thing I want to do is to implement the data access layer to add, get, update and delete an Order document in a document store database. In particular, I need a Get method and a Delete method which takes OrderNumber as parameter, because the client will also need to call the REST API using the order number. So basically I need to implement the below methods (I’m using C#):


    void AddOrder(Guid order);
    Order GetOrder(string id);
    Order GetOrderByNumber(string orderNumber);
    void UpdateOrder(Order order);
    void DeleteOrder(string id);
    void DeleteOrderByNumber(string orderNumber);

For each method, I compared the implementation on each database. The comparison is mainly focused on the amount of code and how straightforward and intuitive it is to code.

	MongoDB	RethinkDB	RavenDB	DocumentDB
AddOrder
GetOrder
GetOrderByNumber
UpdateOrder
DeleteOrder
DeleteOrderByNumber

Here is the code and why I gave these ratings:

1. AddOrder


/* MongoDB */
void AddOrder(Order order)
{
    collection.InsertOneAsync(order).Wait();
}


/* RethinkDB */
void AddOrder(Order order)
{
    order.Id = conn.Run(tblOrders.Insert(order)).GeneratedKeys[0];
}


/* RavenDB */
void AddOrder(Order order)
{
    using (IDocumentSession session = store.OpenSession())
    {
        session.Store(order);
        session.SaveChanges();
    }
}


/* DocumentDB */
void AddOrder(Order order)
{
    Document doc = client
        .CreateDocumentAsync(collection.SelfLink, order).Result;
    order.Id = doc.Id;
}

Not too bad. Extra credit to MongoDB and RavenDB: their client lib can automatically back-fill the DB generated value of Id to my original document object. On DocumentDB and RethinkDB, I need to write my own code to do the back-fill. Note: I’m using the latest official client driver for MongoDB, RavenDB and DocumentDB. For RethinkDB, they don’t have official .NET driver, so I am using a community-supported .NET driver for RethinkDB.

2. GetOrder


/* MongoDB */
Order GetOrder(string id)
{
    return collection.Find(o => o.Id == id).FirstOrDefaultAsync().Result;
}


/* RethinkDB */
Order GetOrder(string id)
{
    return conn.Run(tblOrders.Get(id));
}


/* RavenDB */
Order GetOrder(string id)
{
    using (IDocumentSession session = store.OpenSession())
    {
        return session.Load(id);
    }
}


/* DocumentDB */
Order GetOrder(string id)
{
    return client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == id).AsEnumerable().FirstOrDefault(); 
}

One up for RethinkDB and RavenDB. When I use Id to query, I shouldn’t need to write a search condition like x => x.Id == id.

3. GetOrderByNumber


/* MongoDB */
Order GetOrderByNumber(string orderNumber)
{
    return collection.Find(o => o.OrderNumber == orderNumber)
        .FirstOrDefaultAsync().Result;
}


/* RethinkDB */
Order GetOrderByNumber(string orderNumber)
{
    return conn.Run(tblOrders.Filter(o => o.OrderNumber == orderNumber))
            .FirstOrDefault();
}


/* RavenDB */
Order GetOrderByNumber(string orderNumber)
{
    using (IDocumentSession session = store.OpenSession())
    {
        return session.Query()
            .Where(x => x.OrderNumber == orderNumber).FirstOrDefault();
    }
}


/* DocumentDB */
Order GetOrderByNumber(string orderNumber)
{
    return client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.OrderNumber == orderNumber)
        .AsEnumerable().FirstOrDefault();
}

4. UpdateOrder


/* MongoDB */
void UpdateOrder(Order order)
{
    collection.ReplaceOneAsync(o => o.Id == order.Id, order).Wait();
}


/* RethinkDB */
void UpdateOrder(Order order)
{
    conn.Run(tblOrders.Get(order.Id.ToString()).Replace(order));
}


/* RavenDB */
void UpdateOrder(Order order)
{
    using (IDocumentSession session = store.OpenSession())
    {
        session.Store(order);
        session.SaveChanges();
    }
}


/* DocumentDB */
void UpdateOrder(Order order)
{
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == order.Id).AsEnumerable().FirstOrDefault();
    client.ReplaceDocumentAsync(doc.SelfLink, order).Wait();
}

DocumentDB needs an extra step! I have to do a separate query by Id first, to get back a Document object, then use the SelfLink value on the Document object to call ReplaceDocumentAsync. I don’t understand why the syntax has to be like that.

5. DeleteOrder


/* MongoDB */
void DeleteOrder(string id)
{
    collection.DeleteOneAsync(o => o.Id == id).Wait();
}


/* RethinkDB */
void DeleteOrder(string id)
{
    conn.Run(tblOrders.Get(id).Delete());
}


/* RavenDB */
void DeleteOrder(string id)
{
    using (IDocumentSession session = store.OpenSession())
    {
        session.Delete(id);
        session.SaveChanges();
    }
}


/* DocumentDB */
void DeleteOrder(string id)
{
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == id).AsEnumerable().FirstOrDefault();
    client.DeleteDocumentAsync(doc.SelfLink);
}

Same as in GetOrder, extra point for RethinkDB and RavenDB for not needing a search condition x => x.Id == id when Id is used.

6. DeleteOrderByNumber


/* MongoDB */
void DeleteOrderByNumber(string orderNumber)
{
    collection.DeleteOneAsync(o => o.OrderNumber == orderNumber).Wait();
}


/* RethinkDB */
void DeleteOrderByNumber(string orderNumber)
{
    conn.Run(tblOrders.Filter(o => o.OrderNumber == orderNumber).Delete());
}


/* RavenDB */
void DeleteOrderByNumber(string orderNumber)
{
    using (IDocumentSession session = store.OpenSession())
    {
        var order = session.Query()
            .Where(x => x.OrderNumber == orderNumber).FirstOrDefault();
        session.Delete(order);
        session.SaveChanges();
    }
}


/* DocumentDB */
void DeleteOrderByNumber(string orderNumber)
{
    Order order = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.OrderNumber == orderNumber)
        .AsEnumerable().FirstOrDefault();
    Document doc = client.CreateDocumentQuery(collection.DocumentsLink)
        .Where(x => x.Id == order.Id).AsEnumerable().FirstOrDefault();
    client.DeleteDocumentAsync(doc.SelfLink);
}

MongoDB and RethinkDB are the best for DeleteOrderByNumber. They both only need 1 call. RavenDB needs 2 calls: it first needs to query by OrderNumber, then do the Delete (which presumably will use the Id). DocumentDB is the worst as I need to do 3 calls! Before I can call DeleteDocumentAsync, I first need to do a query by OrderNumber to get the Id, then use Id to query again to get the self-link of this Order document! DocumentDB’s client driver seems to only have one method for delete: DeleteDocumentAsync, which only takes a SelfLink string.

I don’t understand why there isn’t an overload of DeleteDocumentAsync which can take Id. It doesn’t seem to be just me. There are 300 votes on feedback.azure.com asking for the support of deleting a document by id.

Summary

Overall, the data access layer implementation on DocumentDB is a bit inferior experience than on the other three. I hope the DocumentDB team can improve it in the near future.

Foot Note 1:

I was advised that if my Order object is extended from the Microsoft.Azure.Documents.Resource type, it will already have the SelfLink property on it and I will not need the extra step in UpdateOrder and DeleteDocumentAsync.

It works but not acceptable to me. Having the Order object extended from Resource will pollute my domain model. Usually we want our domain model objects to be free of dependencies, so that it works the best for the interoperability across different layers and stacks.

Although strictly speaking, the Order object isn’t in 100% purity on MongoDB. I needed to put a [BsonId] attribute on the Id property. But an attribute is much better than additional member fields introduced by extending from a type in a specific DB’s client driver. For example, one of the major difference is that: in JSON serialization, attributes won’t show up but member fields will.

Foot Note 2:

The Order object was defined slightly differently on each DB. For completeness, here is the exact definitions:


    /* MongoDB */
    [Serializable]
    public class Order
    {
        [BsonId(IdGenerator = typeof(StringObjectIdGenerator))]
        public string Id;
        public string OrderNumber;
        public string ShippingAddress;
    }


    /* RethinkDB */
    [DataContract]
    public class Order
    {
        [DataMember(Name = "id", EmitDefaultValue = false)]
        public string Id;
        [DataMember]
        public string OrderNumber;
        [DataMember]
        public string ShippingAddress;
    }


    /* RavenDB */
    public class Order
    {
        public string Id = string.Empty;
        public string OrderNumber;
        public string ShippingAddress;
    }


    /* DocumentDB */
    public class Order
    {
        [JsonProperty(PropertyName = "id")]
        public string Id;
        public string OrderNumber;
        public string ShippingAddress;
    }

[Update] In August 2015, Azure DocumentDB announced that they have made improvement on the Self-Link thing.

Published April 23, 2015
Author zhengziying
Category Tech
Comments 3 Comments

WhatDidIDo.com

In 2008, I wrote a small tool WhatDidIDo for myself, because I wanted to know where I spent my time. In those days my stress level was a bit high, spending a lot of hours in front of my PC while I still felt that I didn’t have enough time to get all the work done.

The WhatDidIDo program was simple. It was written in Visual C++. It ran in the background, using SetWindowsHookEx() to capture the events which a window was activated and GetWindowThreadProcessId() and GetModuleBaseName() to find out which application does the activated window belong to. Then the program just wrote the data into a comma deliminated CSV file.

The tool worked well and later used by a few other team members too, as they also wanted to understand where they wasted spent their time. Here is the data from one of them, who was among the best developers we had at that moment:

A few days ago I saw someone recommending the RescueTime app. It instantly reminded me about WhatDidIDo. Isn’t it the same idea? How come I didn’t even think of turning WhatDidIDo into a startup company?

On the other hand, of course, turning WhatDidIDo into a startup company wouldn’t guarantee success. Actually there are quite a few similar apps out there, like ManicTime, which are not as hot as RescueTime. There would be a long way to go for WhatDidIDo.com to become today’s RescueTime.

Published April 21, 2015
Author zhengziying
Category Tech
Comments Leave Comment

When LINQ-to-SQL Meets Partitioned View‏

When SQL Server (including SQL Azure) is the database for my projects, I like to use LINQ-to-SQL. Its attribute-based mapping is pretty neat. I also like to use Partitioned View, which makes it easy and fast to purge old data: just drop tables rather than run DELETE commands.

Recently, in a new project when I used LINQ-to-SQL and Partitioned View together, I ran into such an error:

System.Data.SqlClient.SqlException: The OUTPUT clause cannot be specified because the target view "FooBar" is a partitioned view.

I wasn’t able to find a good answer in Bing/Google/StackOverflow. It seemed to me that I might have to look into the source code of System.Data.Linq to find out what was the exact SQL command that LINQ-to-SQL generated and why there was an OUTPUT in there. When I was about to start this laborious source code reading journey, I happened to look at my entity class again and suddenly realized “wait a second, could the problem be the IsDbGenerated = true and AutoSync = AutoSync.OnInsert flags?”:


[Column(Name = "guid_row_id", CanBeNull = false, IsPrimaryKey = true, 
        IsDbGenerated = true, AutoSync = AutoSync.OnInsert)]
public Guid RowId;

“Yeah, that would make sense”, I thought, because if I were to write LINQ-to-SQL by myself, I would probably too have used the OUTPUT clause to implement the IsDbGenerated and AutoSync flags. So I removed IsDbGenerated and AutoSync:


[Column(Name = "guid_row_id", CanBeNull = false, IsPrimaryKey = true)]
public Guid RowId;

Voilà, the error was gone!

Eventually it turns out that the culprit was IsDbGenerated. So as a workaround, I changed my code to generate new row IDs with Guid.NewGuid() in the application code. It’s fine for my project since it’s just a Guid. I guess this issue, that IsDbGenerated in LINQ-to-SQL doesn’t work with partitioned views, would be more troublesome if someone wants to use other DB generated value like GetUTCDate(), which could be quite useful to avoid the time drift issue on the client side, or auto increment integer.

Published April 20, 2015
Author zhengziying
Category Tech
Comments Leave Comment

“Whatever It Takes”

My eyebrows are knitted every time when I hear a manager telling the team “we will do whatever it takes” (e.g. to meet the project deadline).

When someone says “whatever it takes”, it’s likely s/he has little clue of what exactly it takes, although for most of the time it is a figurative phrase. The U.S. President may say “we will do whatever it takes to win the war against the terrorism”. That’s OK because indeed no one knew what would happen in the war. The President merely uses that phrase to express his conviction. But when it comes to meeting an approaching project deadline, there is no need for the manager to express his conviction to the team, since that’s not the best way to mobilize the team, and there should be no unknown in what it exactly takes: the problem is simply that there are more remaining work than the time left. Assuming the time left won’t change and the team is not becoming more productive over night, what it takes to meet the deadline is to 1) decide how much work needs to be cut and what work to cut, then 2) cut them and manage with the consequences. It’s not going to be easy, for sure. But that’s the manager’s job.

When the manager says “we will do whatever it takes”, it passes a negative micro message to the team. It reminds me about those stressed homes. Those home owners vowed “I will do whatever it takes to keep my home”. We all know what happened to them: most of them eventually still lost their home. The need for a manager to say “do whatever it takes” indicates that things are already in a very bad situation. “We will do whatever it takes” gives the team a dismal picture, rather than seeing the hope.

When a manager says “we will do whatever it takes”, s/he has lost the cool. The team can figure that the manager is in panic mode. S/he is in desperate. How can a country keep calm and carry on when her prime minister has lost the cool? How can a team keep calm and carry on when the manager has lost the cool? If a team has lost the calm, how can they remain productive and effective? If a team is unable to remain productive and effective, how can it save an already-late project?

When a manager says “we will do whatever it takes”, s/he is willing to sacrifice the team’s long term future. It’s like a home owner who is withdrawing from the 401K account to keep up with monthly mortgage. Most financial advisors will advise against such way. They will suggest filing bankruptcy and keep 401K intact. A manager who says “do whatever it takes” is willing to quench a thirst with poison. S/he will have no hesitation to burn the team out if that may increase the chance to get the project done in time. That’s scary.

My advice? If you are a manager, never say “we will do whatever it takes”. If you are in a team where such “we will do whatever it takes” moment has happened more than once, you may want to think about whether that’s a good place to stay, because the team seems unable to learn from it’s own mistakes.

Published April 17, 2015
Author zhengziying
Category Tech
Comments Leave Comment

Choosing Between Loggly, Logentries and Elasticsearch

Lately I have been looking for a log management service for my team’s new project which is an engineering tool running as a website + REST API in Azure Websites, interacting with other engineering systems of my group and backed by SQL Azure and MongoDB. The need is basic: have one single place to store all the logs, traces and events from the different pieces in the application, so that my team and I can search the log and use it for troubleshooting. Down the road, we may also set up some simple alerts on top of the logs. For some reasons, I chose to not use the internal systems, but to try something outside.

I tried and compared Loggly, Logentries and Elasticsearch and eventually picked Elasticsearch:

	Loggly	Logentries	Elasticsearch
Hosting	Hosted	Hosted	Self-Hosted
Setup	Easy	Easy	OK
Web UI	Good	OK	Good
.NET Support	Good	Good	OK
Official Documentation	Good	OK	Good
Community & Ecosystem	OK	OK	Good
Cost	OK	OK	OK

Hosting

Both Loggly and Logentries are hosted. They are SaaS. Elasticsearch is an open source software. You have to host it in your own machines. In my case, I put Elasticsearch + Kibana on a Linux VM in Azure. On the other hand, nearly all the popular open source software has hosting providers. Just like there is MongoHQ for MongoDB, RavenHQ for RavenDB and GrapheneDB for Neo4j, there are also hosting providers for Elasticsearch, such as qbox.io and compose.io (formerly MongoHQ). I didn’t try them but it seems qbox.io is pretty decent and the price is reasonable (basically the underlying hosting cost in various public clouds, plus a premium).

Setup

Since Loggly and Logentries are hosted, the setup is really simple: just create an account, fill a form and you are good to go. Setting up Elasticsearch and Kibana for the first time on my own Linux VM took me about 30 minutes to carefully follow this 3rd party instruction step by step. Later, when I did the setup over again, the time was halved. Btw, that instruction is really good quality.

Web UI

Loggly and Elasticsearch (Kibana) tied. Loggly’s UI is more like iPhone: it just works. It’s quite polished and easy to use for people who don’t want to spend a lot of time on learning the tool itself (rather than using the tool to conduct business). Elasticsearch/Kibana is like Android: it’s very powerful and you can get a lot out of it if you know how to configure it and tweak your application’s logging. The analogy is not surprising: both Android and Elasticsearch/Kibana are open source software, while iPhone and Loggly are closed source.

Loggly UI

Elasticsearch/Kibana UI

Logentries’ UI is less satisfactory. It was quite clear to me after a very brief use for 10 minutes or so. The design is relatively less fine-tuned. There seems to be some glitch in the client side scripts, so that sometime some UI elements were not very responsive or behaving in the expected way. In particular, there are three downers in Logentries’ UI:

The row doesn’t expand inline. Both Loggly and Kibana can, which is sometimes pretty convenient.
The results don’t support sorting. It’s always sorted by the event time ascendingly. It’s quite painful that every time that I have to press Page Down or drag the mouse many times to get to the latest rows. In the opposite, both Kibana and Loggly support sorting by time in either ascending or descending way and by default they both show the latest rows on top.
The “X DAYS left in trial” reminder keeps popping up in Logentries UI. It’s intrusive and annoying. For a startup like them, they should understand that the greater conversion rate should organically come from building greater product.

Logentries UI

.NET Support

Loggly and Logentries tied. They both provide official log4net appenders, which are also available as NuGet packages. Their official websites both provide clear app/web.config code examples of how to configure their appenders. Their appenders both work in asynchronous mode, so they can be directly used without noticeable performance overhead. A simple test shows that when their appenders are enabled, continuously calling logger.Info() for 100 times takes less than 100ms, which means <1ms per call.

Elasticsearch doesn’t provide official log4net appender, nor appender for Logstash. That’s a bit disappointing. There are a couple choices on GitHub though, among which log4net.ElasticSearch is the most well-developed one. In my project, I used log4stash, which was forked from log4net.ElasticSearch. But I had to do some work to log4stash before I can use it in my project, because log4stash doesn’t support SSL and my Elasticsearch is exposed on Internet so that my application running in Azure Websites can write logs into it (note: it seems Azure Websites recently started to support Virtual Network, which may eliminate the need to expose my Elasticsearch on Internet). It wasn’t too hard to add SSL support to log4stash, though. I did it in my fork, it worked well in my project and I created a pull request (which hasn’t been accepted yet). Anyone who needs a log4net appender for Elasticsearch with SSL support can grab it from my repo.

Official Documentation

Both Loggly and Elasticsearch’s official documentations are pretty good. No confusion.

Logentries has some room to improve. Take .NET support for example. There is a section on their official website and there is also a documentation on GitHub. The doc on their official website is using the older settings name (LOGENTRIES_TOKEN and LOGENTRIES_ACCOUNT_KEY), while the doc on GitHub uses newer setting names (Logentries.Token and Logentries.AccountKey).

Community & Ecosystem

Elasticsearch has clear winner, although the three were born nearly the same time: Elasticsearch since 2010 (although its root, Lucene, has been around for 16 years); Loggly since 2009; Logentries since 2010.

Search them in StackOverflow and you will get:

Loggly: 447 results
Logentries: 570 results
Elasticsearch: 19,689 results

Searching them in GitHub:

Loggly: 47,592 code results and 235 repository results
Logentries: 45,937 code results and 161 repository results
Elasticsearch: 366,992 code results and 4,658 repository results

It’s not surprising why Elasticsearch has a much bigger/active community: Elasticsearch is an open source software and self-hosted, while Loggly and Logentries are SaaS and closed source.

A plus for Logentries is that Logentries seems to provide better out-of-box integration with other services like Slack, HipChat, PagerDuty, etc. Loggly seems to have out-of-box integration with PagerDuty, but not HipChat or Slack. My quick search didn’t find any out-of-box integration of Elasticsearch with Slack, HipChat, etc., though I’m sure there are something ready for use in the community.

Cost

None of the three options is free, although Loggly and Logentries both offer a 30-day free trial period. After that, their entry level’s prices are:

Loggly: $49/mo for 1GB/day and 7-day retention
Logentries: $29/mo for 20GB/mo and 14 day retention
Elasticsearch: $65/mo for a Basic tier A2 VM in Azure (2 cores, 3.5GB memory, 60GB disk)

Purely from cost saving perspective, if I were doing a side project, I would probably go for Logentries. In my current project, since Microsoft employees can use Azure for free (note: the charge goes to our department), a Linux VM running Elasticsearch+Kibana is for free to me.

Other Options

As mentioned this recent article, “Picking a cloud log management service”, there are a couple other choices for a SaaS log management service providers, such as: Splunk, Sumo Logic and Papertrail. I agree with that article that Splunk seems overkill for small projects and Sumo Logic doesn’t seem to fit. Papertrail looks a lot like Loggly and Logentries. I will give it a try when I get chance, though I don’t expect Papertrail to show too much difference than Loggly and Logentries.

Last but not least, none of the three big public cloud providers provide a comprehensive SaaS log management service as Loggly and Logentries do.

Amazon: AWS has the Amazon CloudWatch. But from what I read and confirmed by the Picking a cloud log management service article (written in Jan 2015), Amazon CloudWatch is only for EC2 instances.
Google: The recently announced Google Cloud Logging looks like a SaaS log management service, but relatively primitive, compared to Loggly, Logentries and Elasticsearch/Kibana. Plus, it seems to only support sending log from application in Google App Engine and VMs in Google Compute Engine.
Microsoft: Azure doesn’t seem to offer a log management service. Although a couple weeks ago as a part of the announcement of the new Azure App Service (which is kind of the v2 of Azure Websites), it provides the log collection, viewing and streaming.

It seems to be a common theme that Amazon, Google and Microsoft’s log management capability in their public cloud offering is only for the applications and VMs running in their own public cloud^[1]. That kind of lack of openness is a bit disappointing.

[1] Update 04/12/2015: The logging in Azure Application Insights should work for any .NET applications. It provides a log4net appender as well as a listerner for System.Diagnostics.Trace.

Published April 8, 2015
Author zhengziying
Category Tech
Comments 1 Comment

Finding the Compatible SQLCMD.exe

I wrote this post because I hope it can save other people some time. When I ran into this issue this week, I searched in Bing/StackOverflow/etc. and couldn’t find a direct answer for it. So I spent some time to do my own troubleshooting, try different solutions and have figured out a workable one. This post captures the issue and my solution, so that hopefully in the future when other people run into the same issue, they will find this post by searching in Bing/Google.

The Issue

In my unit test’s TestInitialize code, it runs such a sqlcmd.exe command:


sqlcmd.exe -S (LocalDB)\UnitTest -E -d Jobs_DBTNXXVKQ3K6 -i "..\src\SQL Database\Jobs\Tables\Jobs.sql"

It works fine on my laptop, but it fails and returns below error when running in the build in Visual Studio Online:


HResult 0xFFFFFFFF, Level 16, State 1

SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF].

Sqlcmd: Error: Microsoft SQL Server Native Client 10.0 : A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online..

Sqlcmd: Error: Microsoft SQL Server Native Client 10.0 : Login timeout expired.

That’s because this sqlcmd.exe was v10.0 (SQL Server 2008) and it’s incompatible with LocalDB, which was introduced in SQL Server 2012 (v11.0). Only the sqlcmd from SQL Server 2012 or later works with LocalDB.

The Solution

The solution is to find a later version of sqlcmd.exe on the build host of Visual Studio Online, and pinpoint to it in my TestInitialize code. This and this page listed what’s installed on the build host, but for obvious reasons, in my TestInitialize I must do a search instead of using a hard-coded path.

A little surprise was that to do the file search in the build host in Visual Studio Online, I couldn’t use the DirectoryInfo.GetFiles() method with the SearchOption.AllDirectories parameter. That would throw exception when it gets denied access to some folders and there doesn’t seem to be a way to let DirectoryInfo.GetFiles() just ignore any directory that it get access denied.

So I ended up writing a traversal by myself rather than using DirectoryInfo.GetFiles(). The traversal is kind of time consuming: it takes about 15-30 seconds on my laptop (probably because I’ve installed too many stuffs under the program files folder). So I added a shortcut: first try to look for it at a few known possible places; if found, then the time-consuming traversal can be saved.

Here is the full code for finding the compatible SQLCMD.exe:


/* Only v11.0 and later version sqlcmd.exe is compatible with SQL Server 
 * LocalDB. Unfortunately, at this moment, the default sqlcmd.exe in VSO
 * is v10.0. This method is to find a compatible sqlcmd.exe. It doesn't 
 * have to be the latest.
 */
private static FileInfo FindCompatibleSqlcmd()
{
  /* Try a few possible known places first. If found, it saves time in 
   * doing the full blown search. 
   */
  string[] knownPossiblePlaces = new string[]{
     @"C:\Program Files\Microsoft SQL Server"
     + @"\Client SDK\ODBC\110\Tools\Binn\SQLCMD.EXE"
    ,@"C:\Program Files\Microsoft SQL Server" 
     + @"\110\Tools\Binn\SQLCMD.EXE"
  };
  foreach (var file in knownPossiblePlaces)
  {
    if (File.Exists(file))
    {
      logger.LogInfo("Got a match in known places: {0}", file);
      return new FileInfo(file);
    }
  }

  /* Now do a full blown search, using code sample from
   * https://msdn.microsoft.com/en-us/library/bb513869.aspx
   */
  FileInfo answer = null;
  string[] paths = new string[]{
    Environment.GetFolderPath(Environment.SpecialFolder.ProgramFiles),
    Environment.GetEnvironmentVariable("ProgramW6432"),
    Environment.GetEnvironmentVariable("ProgramFiles")
  };
  foreach (var path in paths.Distinct())
  {
    Stack stack = new Stack();
    stack.Push(path);

    while (stack.Count > 0)
    {
      string currentDir = stack.Pop();
      string[] subDirs;
      try
      {
        subDirs = Directory.GetDirectories(currentDir);
      }
      catch (UnauthorizedAccessException)
      {
        logger.LogInfo("Access denied to folder: {0}" , currentDir);
        continue;
      }
      catch (DirectoryNotFoundException)
      {
        logger.LogInfo("Access denied to folder: {0}" , currentDir);
        continue;
      }

      string[] files = null;
      try
      {
        files = Directory.GetFiles(currentDir);
      }
      catch (UnauthorizedAccessException)
      {
        logger.LogInfo("Access denied to folder: {0}" , currentDir);
        continue;
      }
      catch (DirectoryNotFoundException)
      {
        logger.LogInfo("Directory not found: {0}" , currentDir);
        continue;
      }

      foreach (string file in files)
      {
        try
        {
          FileInfo fi = new FileInfo(file);
          if (fi.Name.Equals("sqlcmd.exe"
                            , StringComparison.OrdinalIgnoreCase))
          {
            logger.LogInfo("Found: {0}, created on {1}"
                           , fi.FullName
                           , fi.CreationTime);
            if (null == answer || answer.CreationTime < fi.CreationTime)
            {
              answer = fi;
              logger.LogInfo("Update answer to: {0}", fi.FullName);
            }
          }
        }
        catch (FileNotFoundException)
        {
          logger.LogInfo("File was just deleted: {0}", file);
          continue;
        }
      }

      foreach (string str in subDirs)
        stack.Push(str);
    }
  }
  logger.LogInfo("FindCompatibleSqlcmd result: {0}" 
         , answer == null ? "null" : answer.FullName);
  return answer;
}

I hope this will be helpful to somebody some day.

Published April 3, 2015
Author zhengziying
Category Tech
Comments Leave Comment

What Should Swiss Watchmakers Do About Smart Watches?

I have been wearing an Omega Seamaster daily for several years. But a few months ago, I switched to the Pebble Steel. Since then, the Omega watch has been sitting in the winder all the time because the Pebble gives me one thing really important: I am not missing any calls or meeting reminders any more.

In the past, I missed a lot calls and meeting reminders because the phone was put in vibration mode or I didn’t hear it ringing because of other noises. For example:

My phone was being charged in my office and I was talking to someone in the next door. Unfortunately, in the previous meeting, I turned it to vibration mode. I was so focused on the discussion and lost track of time. I missed the reminder of the next meeting and only realized it when it’s already half way through. I felt awful to be so much late to the meeting. Sometime the meeting had to be rescheduled due to my no show and it was me that wasted others’ time.
My phone was being charged in the family room when I was cooking dinner in the kitchen. The range hood was running right in front of me and I can’t hear anything else. My wife called me on her way home to ask me if I want her to pick up any grocery or bakery. I missed her call (and the first thing she said when she came in was “why didn’t you answer my call”).
My phone was in vibration mode and unfortunately put on an ottoman. The soft cushion of the ottoman absorbed all the vibrations when a call came in and I missed the call.
My wife and I were skiing and we put our phone in the pocket of the ski jacket. When the ski school called to tell us to pick up our boy earlier, we both missed the call. We couldn’t hear the ring because we had helmet. We didn’t feel the vibration because we were in motion and we had layers under the ski jacket.

Those are not happening any more since I got the Pebble. When there is an incoming call or meeting reminder, it vibrates and I can feel it because it’s tied to my wrist and touches my skin directly. But I am now in constant struggling. When it comes to the craftsmanship, the Pebble can’t compare to my Omega. The Omega Seamaster feels much better and looks much more beautiful.

Omega Seamaster

But I end up now wearing the Pebble everyday because it solves a big problem for me. So when it comes to “How should luxury watch brands like Rolex strategically respond to the launch of the Apple Watch?”, I hope they won’t try to wedge some crappy half-baked smartwatch features into their beautiful watches. On the other hand, I hope they can add two things to the traditional watches:

Sync up with my phone and vibrate when there is a notification. Vibration would be enough. It doesn’t need to display anything. As long as I feel the vibration, I know I should check my phone.
Have health sensors and send data back to my phone as how Pebble, Fitbit, Jawbone and soon Apple Watch do. I love to see how many steps I have walked each day, how long I slept last night and how many hours were deep sleep.

Well, adding these two things to my Omega Seamaster means adding battery to it and it needs to be charged every a few days. That’s OK with me. Even if I forget to charge the “smart Omega” and the battery runs out, it’s still a really nice Omega like what it is today. If Pebble or Apple Watch’s battery is dead, it will be a piece of useless metal, or a beautiful bracelet at the best.

Bottom line: as Michael Wolfe said, they should keep Rolex watches Rolex-y.

Published March 23, 2015
Author zhengziying
Category Tech
Comments Leave Comment

The Versioning Hell of Microservices

Recently in his blog post “Microservices. The good, the bad and the ugly”, Sander Hoogendoorn warned us about the versioning hell of microservices. He wrote:

“Another that worries me is versioning. If it is already hard enough to version a few collaborating applications, what about a hundred applications and components each relying on a bunch of other ones for delivering the required services? Of course, your services are all independent, as the microservices paradigm promises. But it’s only together that your services become the system. How do you avoid to turn your pretty microservices architecture into versioning hell – which seems a direct descendant of DLL hell?”

I wrote almost the same in an internal discussion late last year in my team. In my internal memo “Transient States Testing Challenge”, I warned that this problem is emerging as a component gets split into a few smaller pieces (aka. microservices), the testing cost may increase substantially and we must understand and prepare for it. Here is the full text of the memo (with sensitive details removed):

Before

When it was just a one piece, say it’s a service X, there would be no transient state. During the upgrade, we will put the new binaries of X to a dormant secondary. Only when the secondary is fully upgraded, we will promote it to primary (by switching DNS, like the VIP Swap Deployment on Azure) and the demote the original primary to secondary. That promote/demote is considered instant and atomic:

That promote/demote is considered instant and atomic

Looking inside the box, X consists of five pieces: say A, B, C, D and E. When each of the five teams is developing their own v2, they only need to make sure their v2 code can work with the v2 code of others. For example, team A only needs to test that A2 works with B2, C2 and D2, which is the final state {A2, B2, C2, D2}.

Team A doesn’t need to test A2 against B1, C1 and D1

Team A doesn’t need to do integration test of A2 with B1, C1 and D1, because that combination would never happen.

After

As we are splitting X into smaller pieces, each smaller piece will be independently deployable. The starting state and final state remain unchanged but because there is no way to strictly fully synchronizing their upgrades, the whole service will go through various transient states on the path from the starting state to the final state:

various transient states on the path from the starting state to the final state

In an overly simplified way (for the convenience of discussing the problem), there are two choices in front of us:

Choice #1: Only deploy the four of them in a fixed order, and only one at a time. For example, A->C->D->B and the transition path will be:

deploy the four of them in a fixed order

Therefore, in testing, not only we need to make sure {A2, B2, C2, D2} can work together, we will also need to test three additional states:

{A2, B1, C1, D1}
{A2, B1, C2, D1}
{A2, B1, C2, D2}

The amount of additional states to test equals N-1 (where N is the number of the pieces). The caveat of this approach is that we are losing the flexibility regarding the order of deployments. If C2 deployment is blocked, D2 and B2 are blocked, too. That’s against agility.

Choice #2: Not to put any restriction on the order. Any piece can go at any time. That gives us the flexibility and helps agility, at the cost of having to do a lot more integration testing to cover more transient states:

{A2, B1, C1, D1}
{A1, B2, C1, D1}
{A1, B1, C2, D1}
{A1, B1, C1, D2}
{A2, B2, C1, D1}
{A2, B1, C2, D1}
{A2, B1, C1, D2}
{A1, B2, C2, D1}
{A1, B2, C1, D2}
{A1, B1, C2, D2}
{A1, B2, C2, D2}
{A2, B1, C2, D2}
{A2, B2, C1, D2}
{A2, B2, C2, D1}

The amount of additional states to test equals 2^N-2 (where N is the number of the pieces): N=3 -> 6; N=4 -> 14; N=5 -> 30; …. That’s getting very costly.

Possible Optimizations?

We could have some optimizations. For examples:

Among the four pieces (A, B, C and D), make some of them orthogonal, to eliminate the need of testing of some transient states.
Find a middle ground between following a fixed order vs. allowing any kind of order. For example, we can say A and B can go in any order between them and C and D can go in any order between them, but C and D must not start the upgrade until both A and B are finished. That will reduce the number of possible transient states.
…

But these optimizations only make the explosion of permutations less bad. They don’t change the fundamentals of this challenge: the need of testing numerous transient states.

Backward Compatibility Testing?

Another possible way to tackle this is to say that let’s invest in the coding-against-contract and backward compatibility test of A, B, C and D so that we can fully eliminate the need of testing the transient states. That’s true, but it brings in its own costs and risks:

1. Cost

By suggesting investing in backward compatibility test to get away with the testing of numerous transient states, we are converting one costly problem to another costly problem. As a big piece splits into smaller one, the sum of the backward compatibility test cost of all the smaller pieces is going to be significantly more than the original backward compatibility test cost when we have just 1 piece.

That’s just plain math. In backward compatibility test, you are trying to make sure the circle is friendly with its surrounding. When a big circle splits into multiple smaller circles while keep the total area unchanged, the sum of the circumferences of the smaller circles is going to be much bigger than the circumference of the original big circle:

the sum of the circumferences of the smaller circles is going to be much bigger

2. Risk

Only validating against contract can cause missing some very bad bugs, mainly because it’s hard to use contracts to capture some small-but-critical details, especially those behavioral details.

Testing in Production?

One may suggest that we do neither: not to test all the permutations in transient states, nor do that much backward compatibility testing between microservices. Instead, why don’t we ship into production frequently with controlled small exposure to customers (aka. flighting, first slice, etc.) and do the integration test there? True but still, we are converting one costly problem to another costly problem, since testing in production also requires a lot of work (in design and in testing), plus engineering culture shift.

What’s my recommendation?

No doubt that we should continue to move away from the one big monolithic piece approach. However, when we make that move, we need to keep in mind the transient states testing challenge discussed above and look for a balanced approach and the sweet spot in the new paradigm.

Published March 22, 2015
Author zhengziying
Category Tech
Comments 1 Comment

The Great Value of Knowing the Achievability

I have heard people say “Don’t tell me this is achievable. Tell me how to achieve.” So is it useless to know the achievability while not knowing how to? Not really. Knowing how to achieve will definitely make things more straightforward, but when you don’t know how to, there is still a great value to know the achievability.

Jared Diamond, in his book “Guns, Germs, and Steel: The Fates of Human Societies”, said that in the history, there were two ways how a society learned skills (e.g. grow wheat, tame cows, etc.) from another society in a nearby region:

They learned the skills directly.
They didn’t learned the skills directly, but they learned the achievability of certain things, then figured out the exact method by themselves.

For example, a society learned from a nearby society that cows are tamable and make good milk. They probably didn’t get to learn the exact method of domesticating cows from the nearby society, due to the language barrier, the nearby society’s unwillingness to share knowledge in order to keep competitive advantages, or whatever reasons. But having seen that the nearby society has successfully domesticated cows gave them the faith that cows are tamable and conviction to search for ways to do it. Also, this society wouldn’t potentially waste time trying to tame zebra or antelope. Jared pointed out that knowing which way is a dead end vs. which way can go through helped a lot of societies significantly shorten the time it took for them to advance their developments.

Knowing the achievability has great value in software engineering as well.

Speaking from my own experience. In my current group, a full test pass have about 4,000 test cases in total. When I joined, it took more than 24 hours to run and had only 80%-90% passing rate during most part of a release cycle. People were not happy with it but most of them seemed to think that’s just how it supposed to be. I told them no, it definitely can be much better. I told them that when I joined my prior group, things were in bad shape, too: it took >24 hours to run 12,000 test cases and similarly had only 80%-90% rate, but later we fixed it: the test duration shortened to sub-day (somewhere around 16-18 hours) and the failure rate dropped to low single digit %. I kept repeating this to everybody, telling them that since it was achieved in my prior group, we must be able to achieve it here as well. I also told them to have the right expectation of time: in my prior group, it took more than a year to fix it, so we should also anticipate similar amount of time to fix it here.

Knowing the achievability helped. Knowing approximately how long it will take to get there also helped. My team committed to improve the test efficiency and agility and made small and big investments one after another. After about 15 months, we were able to finish the same full test pass in just 7.5 hours. After about 2 years since we started the endeavor, 98-99% passing rate has become the norm. Had we not known the achievability, my team probably would have hesitated to make the investment, or pulled some resource out in the middle of the endeavor, due to having not (yet) seen the light of the end of the tunnel.

Published March 11, 2015
Author zhengziying
Category Tech
Comments 1 Comment

zhengziying.com