A few weeks ago, I read a blog post on StackOverflow stating, "What is the minimum information we need for monitoring?" It's a question you do not hear every day and often can't be answered immediately.
After studying this topic, I concluded that I would ask the following three questions
What are the overall goals involved?
What is the technology and architecture of the system under monitoring?
Who is going to use the monitoring platform?
When it comes to the goal of monitoring, it's often about reducing MTTR, identifying issues before our customers are impacted, or ensuring that applications and infrastructure is healthy.
The question about the involved technology stack is critical. You have many monitoring solutions, but not all might help you in your extraordinary situation. For example, transaction tracing products are powerful but work only for specific byte-code-based systems.
Who will use the monitoring platform? Is it for your internal teams only to support problem analysis? Do they also need reporting capabilities for business teams? Getting clarity about the future user community of your monitoring suite can save you a lot of trouble.
Minimum viable information when finding and fixing issues?
Monitoring, or "Observability" as we call it these days, is not about creating alarms but understanding and showing insights into how your systems behave and the bottlenecks involved. Ideally, we get these insights in real time and have self-repair capabilities built in.
From my perspective, we are no longer in a position to rely on logs or infrastructure alone. For instance, if we see an error message in a specific record, we can't conclude if the end users will be impacted and if remediation actions will be required. Of course, this depends on the three questions above, but for classic business applications, your best chance to find and fix issues is a holistic approach. Development teams should not simply focus on developing business features but should likewise produce observable applications. We need traces, logs, metrics as a starting point, advanced analysis, and reporting features.
Why do these questions matter?
Implementing a proactive monitoring approach is not that easy. Nevertheless, observability is in high demand for every organization because they can't provide high-quality services without proactive problem resolution. Invest too much in the wrong solutions, and you will see that profitability goes down. At the same time, too little care about your observability increases the chances of a bad user experience and loss of reputation, and a decline in sales.
As for any solution you build, keeping quality, time, and costs balanced is essential.
So please don't make a mistake and ignore one of them because the most cost-attractive monitoring suite can result in a high price tag due to overspending for implementation, integration, maintenance, and not identified production issues.
You can undoubtedly rely on building your own or open-source products for some basic monitoring needs. The more advanced goals you have you will realize that reinventing the wheel does not make sense, and you could save a lot of money by using specialized products.
Keep up the great work! Happy Performance Engineering!
Link to the post on StackOverflow