Updated: 2 hours ago
Have you ever driven a car that had no fuel control? Such vehicles were quite common about 30 years ago. There was less monitoring equipment in our treasured driving machines in those days, and we used to check the fuel, oil, and water levels before setting out. In those days, drivers also found keeping an extra gallon of fuel in the car helpful. Nowadays, we can access all kinds of assistance if we find ourselves in a critical situation.
Business applications behave similarly to cars. If we ignore the indications of problems, the applications stop doing the work. The nature of our modern IT services means they slow down when too many errors have to be processed or even become entirely unavailable for customers due to the slowness. When something isn’t working correctly, we try superficial solutions, such as restarting the machines, services, or processes. These habits can sometimes become a daily routine that sees operational teams jumping from one restart to the next.
The consequence of such a short-lived fix-by-restart approach is that skilled engineers spend most of their time on dull, tedious, routine tasks. You lose valuable time like this and cannot utilize your brains for more challenging optimization and improvements that you and the business will find more rewarding. And as you might expect, such problematic services will only worsen when your business grows. Then, you’ll need even more engineers just for maintenance. This is a real scalability drama. Your operational costs will eat up a significant proportion of your revenue as a result.
How do you overcome such nightmares?
Contemporary cars offer early indicators for all kinds of potential problems. The anti-lock braking system, for instance, is one such safety feature. It can save our lives if we don’t drive carefully enough or slam down the brakes too hard. Imagine going without all those safety features. I’m convinced that driving would be much more dangerous, and we would have a lot more deadly accidents.
But how does all this talk about cars relate to software and business applications? For one thing, we’re a lot more careless with those all-important services. I believe this “devil-may-care” attitude is because software problems, including slowdowns and crashes, don’t impact our lives immediately, so we often operate business applications when they’re in a critical state.
If a severe issue of our IT services occurs, our customers will know about it first. They’ll send us their complaints, and depending on the type of service, we’ll see an increase in abandon rates and revenue losses. Once we start losing our clients, returning to the previous business levels will be very difficult.
When a particular issue continues to impact us, we’ll eventually start to do something about it. At first, though, we tend to ignore specific problems because we don’t want to face them. This means we will likely address the problematic IT application instead of investigating the error conditions and the causes. The regular workarounds in this situation block scalability and are time-consuming. Unless you identify and resolve the root causes, your business service will soon be on the waiting list for retirement.
The best approach is to detect the issues as early as possible—don’t settle for any short-term fixes—but work on implementing a permanent solution. Take some decisive mitigation measures to prevent the issue from re-occurring.
With this proactive mindset, once your team is fully aligned, you can monitor all your applications for the golden signals (outlined below). Your engines will run well if these signals are within the agreed boundaries.
Monitoring can be overwhelming because thousands of metrics are available nowadays, and you can easily focus on the wrong set of metrics.
These four golden signals are the indicators that need the closest attention:
This, also known as the response time, refers to the time it takes to service a request. This is the most potent performance metric because it shows whether our applications are in good shape to meet or exceed our users’ expectations. So long as the latency is within the agreed boundaries, no significant action is required.
This is the demand being placed on your system. You’ll find it helpful to know the expected number of requests your IT services make. Don’t focus only on the throughput for an average business day, though. Depending on the nature of your business, your system could encounter peak conditions at any time. These peaks could be as much as ten times higher than usual.
Moreover, extreme events can even bring 100 times more transactions to a business’s IT services. Consider how specific weather conditions impact airline operations, for example. A snowstorm, for instance, can lead to airport closures, hundreds of canceled flights, and thousands of rescheduled flights within a very short time.
The rate of failed requests is known as errors. Although Latency and Throughput may be within the agreed boundaries, errors can still occur, and your customers won’t be happy with the quality of services. A customer might add products to the basket and then see all the items disappear during checkout. The chances are that you’ll have lost a paying client due to a backend error. Erroneous services are a severe problem that requires immediate action.
Measure how full your service is. System resource utilization, such as CPU usage, memory usage, disk space, and IOPS (input/output operations per second), should never be ignored. Even though your response times may be excellent, the latency can spike immediately if your servers reach saturation point. In the worst-case scenario, the entire application can become unavailable. To avoid such an unwelcome event, you need to set realistic warning indicators and keep an eye on the utilization of all servers. Your capacity management tasks should include regular reviews of this metric.
Collecting information from the four golden signals is an excellent starting point, but who should use those metrics, and for what purpose?
Chart the four golden signals in diagrams for 3-month and 6-month periods, then compare them to find out what has recently changed. Is anything rising or falling? Are there any correlations? What conclusions can you draw?
For instance, if the latency and CPU utilization have risen within the last six months, you should take preventive measures such as code tuning or hardware upgrades.
Spotting arising issues before your customers are impacted is critical. Agree on meaningful thresholds and service-level objectives, then use them to send problem notifications to your engineers.
Do you know if your applications are faster than the industry average? What is the impact of a hardware change on your end-to-end latency? Such questions are easy to answer when you abide by the four golden monitoring rules.
Collecting metrics is still not sufficient. You should share the good news and bring it to the attention of all your employees and customers. Dashboards act as excellent information radiators. Present the current and trending charts for your golden signals to your staff, and let them understand the valuable services business applications deliver and how they deliver them.
It’s equally essential to derive the steps required from your continuous monitoring and analysis results. Also, remember that metrics mean nothing if you don’t act on them.
Contact me or my team to learn how a forward-thinking monitoring strategy reduces operational risks and brings back the freedom you deserve.
Happy Performance Engineering!