In Part 1, I mentioned that there are three monitoring approaches: a) invariants, b) metrics/logs, c) synthetic transactions. Here I like to explain them.
Before I do, I want to put them in a broader context: what is live site monitoring about. In my opinion, live site monitoring is about answering two questions:
“Is everything working?” & “Is everything healthy?”
There is a subtle but important difference between working vs. healthy: working = producing expected results (in terms of correctness, latency, …); healthy = being in a good condition (in order to remain working in the future).
A few real life examples may help elaborate the difference between working vs. healthy:
- A car is driving (working), but water temperature is running high (not healthy, it may break down soon);
- A team is shipping product on time (working) but people are burned out (not healthy), or, people have good work/life balance (healthy) but are not delivering (not working);
- An ATM machine can still spit money (working), but it’s low on cash (not healthy);
- A Web server is serving requests fine (working), but its CPU is at 70% on average (not healthy), or, it’s returning “Internal Server Error” pages (not working) though its CPU is below 10% (healthy), or it’s running on high CPU (not healthy) and not responding to requests (not working).
The three approaches (invariants, logs and synthetic transactions) are the three different ways to find out the answer to the first question, “is everything working”:
a) Invariants. That’s the laws of physics. Invariants are the evaluations that should always be true. When an evaluation is not true, there is something wrong in somewhere. For examples,
- It should always be true that the current balance equals to the previous balance – all spending since last + all income since last. If those numbers don’t add up, something was wrong or missing.
- If 200 is the hard ceiling of the number of virtual machines every account can have, and if some report says an account has more than 200 virtual machines, something must be wrong.
b) Metrics & Logs. Not just the trace and event logs, but also more importantly, the aggregated data. The log of every transaction can be aggregated at various levels on different dimension (per region, per OS type, per minute/hour/day/week/month/quarter, …). Then we can analyze it to catch the anomalies and outliers. For examples,
- The API foo usually have 0.05% failure rate, but in the last 5 minutes, its failure rate was above 1%.
- In the last several weeks, the mean time to provision a new virtual machine was 4 minutes and the 99th percentile was 15 minutes. But in the last several hours, the mean time to provision a Extra Large size Windows Server 2012 R2 virtual machines jumped to more than 20 minutes, although the overall mean time remain unchanged.
The approach of detecting anomalies and outliers based on aggregated logs has some limitations:
- It won’t help much when there isn’t enough data. Low data volume can be due to various reasons: in a new instance that haven’t been opened up to all customers, the traffic is low and there isn’t a lot of historical data to establish a benchmark; some APIs are low volume by nature, such as only a few hundreds calls per day; etc..
- It may not be able to tell the full picture of how the system is perceived from outside. For example, when the API calls are blocked/dropped before they have reached our service, we won’t have any log about those calls. (Note: however, a sudden drop of API call volume is an anomoly that can be caught through log analysis.
c) Synthetic Transactions. This is the most straightforward one: if the bank wants to know whether an ATM machine is working, they can just send someone to withdraw $20 from that ATM machine.
The synthetic transactions approach has some limitations, too:
- Representativeness. Even if the bank employee can withdraw $20 from that ATM machine, it doesn’t guarantee all customers will be able to, too.
- Cost. If the bank uses this way to monitor all its ATM machines (could be thousands) are working, they have to send people to withdraw $20 from every ATM machine every hour. That will be huge amount of labor cost and millions of dollar getting withdrawn by bank employees which also needs to be returned to the bank.
- Blind spots, as explained in Part 1.
In the case of monitoring ATM machines, the approach of detecting anomalies/outlier based on aggregated logs will be much more effective. As long as the log shows that in the last a couple hours customers have been able to successfully withdraw money from an ATM machine, we know this ATM machine is working. On the other hand, for a busy ATM machine (e.g. sitting at the first floor of a busy mall), if the log shows that there is no money withdrawn in the last two hours between 2pm-4pm on Saturday afternoon, the bank has a reason to believe the ATM machine may not be working and better send somebody over there to take a look.
To be continued … (Part 3)