Monitoring the four golden signals
Have you ever driven a car that had no fuel control? Such cars were actually quite common about 30 years ago. There was less monitoring equipment in our treasured driving machines in those days, and we used to check the fuel, oil and water levels before setting out. In those days, drivers also found it useful to keep an extra gallon of fuel in the car. Nowadays we have access to all kinds of assistance if we find ourselves in a critical situation.
Business applications behave in a similar way to cars. If we ignore the indications of problems, the applications stop doing the work. The nature of our modern IT services means they slow down when too many errors have to be processed, or even become completely unavailable to customers due to the slowness. When something isn’t working properly, we tend to try superficial solutions such as restarting the machines or the services or processes. These habits can sometimes become a daily routine that sees operational teams jumping from one restart to the next.
The consequences of such a short-lived fix-by-restart approach is that skilled engineers end up spending most of their time on dull and boring routine tasks. You lose valuable time like this and have no chance to utilize your brains for more the more challenging optimization and improvements that you and the business will find more rewarding. And as you might expect, such problematic services will only get worse when your business grows. Then you’ll need even more engineers just for maintenance. This is a real scalability drama. Your operational costs will eat up a big proportion of your revenue as a result.
How to overcome such nightmares
Contemporary cars boast/offer early indicators for all kinds of potential problems. The anti-lock braking system, for instance, is one such safety feature. It can save our lives if we don’t drive carefully enough or slam down the brakes too hard. Imagine driving without all those safety features. I’m convinced that driving would be much more dangerous and we would have a lot more deadly accidents.
But how does all this talk about cars relate to software and business applications? We’re a lot more careless with those all-important services for one thing. I believe this “devil-may-care” attitude is due to the fact that software problems, including slowdowns and crashes, don’t impact our lives immediately, and so we often operate business applications when they’re in a critical state.
If a serious issue in our IT services occurs, our customers will know about it first. They’ll send us their complaints, and depending on the type of service, we’ll see an increase in abandon rates as well as revenue losses. Once we start losing our clients, it’ll be very difficult to return to the previous business levels.
When a certain issue continues to impact us, we’ll eventually start to do something about it. At first though, we tend to ignore certain kinds of problems because we don’t want to face them. This means we’re likely to address the problematic IT application instead of investigating the error conditions and addressing the causes. The regular workarounds needed in this situation block scalability and are also very time-consuming. Unless you identify and resolve the root causes, your business service will soon be on the waiting list for retirement.
The best approach is to detect the issues as early as possible—don’t settle for any short-term fixes—but work on implementing a permanent solution. Take some powerful mitigation measures to prevent the issue from re-occurring.
With this proactive mindset in place and once your team is fully aligned, you can monitor for the golden signals (outlined below) in all your applications. As long as these signals are within the agreed boundaries, your engines will be running well.
Monitoring can be an overwhelming task because thousands of are available nowadays and you can easily focus on the wrong set of metrics. These four golden signals in your business applications are the indicators that need the closest attention:
This, also known as the response time, refers to the time it takes to service a request. This is the most powerful performance metric, I believe, because it shows us right away whether our applications are in good shape to meet or exceed our users’ expectations. So long as the latency is within the agreed boundaries, no major action is required.
This is the demand being placed on your system. You’ll find it helpful to know the expected number of requests made by your IT services. Don’t just focus only on the throughput for an average business day though. Depending on the nature of your business, your system could encounter peak conditions at any time. These peaks could be as much as10 times higher than usual. What’s more, extreme events can even bring 100 times more transactions to a business’s IT services. Consider how certain weather conditions impact airlines operations, for example. A snowstorm, for instance, can lead to airport closures, hundreds of canceled flights and thousands of rescheduled flights within a very short time.
The rate of failed requests is known as errors. Although Latency and Throughput may be within able the agreed boundaries, errors can still occur, and your customers won’t be happy with the quality of services. A customer might add products to the basket then see all the items disappear during the checkout process. The chances are that, due to a backend error, you’ll have lost a paying client. Erroneous services are clearly a serious problem that requires immediate action.
Measure how full your service is. System resource utilization such as CPU usage, memory usage, disk space and IOPS (input/output operations per second) should never be ignored. Even though your response times may be excellent, if your servers reach saturation point, the latency can spike immediately. In the worst-case scenario, the entire application can become unavailable. To avoid such an unwelcome event, you need to set realistic warning indicators and keep an eye on the utilization of all servers. Your capacity management tasks should include regular reviews of this metric.
Collecting information from the four golden signals is an excellent starting point, but who should use those metrics and for what purpose?
Chart the four golden signals in diagrams for 3-month and 6-month periods, and then compare them to find out what has recently changed. Is anything rising or falling? Are there any correlations? What conclusions can you draw?
If the latency and CPU utilization have both risen within the last 6 months, for instance, you should take some preventive measures such as code tuning or hardware upgrades.
Spotting arising issues before your customers are impacted is key. Agree on meaningful thresholds and service-level objectives, and then use them for sending out problem notifications to your engineers.
Do you know if your applications are faster than the industry average? What is the impact of a hardware change on your end-to-end latency? Such questions are easy to answer when you abide by the four golden rules of monitoring.
Collecting metrics is still not sufficient. You should share the good news and bring it to the attention of all your employees and customers. Dashboards act as excellent information radiators. Present the current and trending charts for your golden signals to your staff, and let them understand the valuable services business applications deliver and how they deliver them.
It’s equally important to derive the steps required from the results of your continuous monitoring and analysis. Also remember that metrics mean nothing if you don’t act on them. Contact me or my team to learn how a forward-thinking monitoring strategy reduces operational risks and brings back the freedom you deserve.