Online Banking, trading, or core banking platforms are the heart of financial institutions, and they need to fulfill the most vital reliability requirements. Regulatory agencies such as MAS or FINMA closely monitor how financial institutions build, test, and operate their business-critical services. Latest one hour after a critical service crashed, the agency had to be informed.
A state-of-the-art monitoring stack helps you to detect and escalate critical outages automatically. We know that availability of your entire application lives and dies on the reliability of every single component.
To cut the story short, significant elements of a successful monitoring approach are:
# Active monitoring
Active monitoring mimics the actions of a real user. It helps you to see your entire application with the eyes of your customer. You select a set of top business processes and automate them with a synthetic monitoring platform. The created monitoring scripts will be executed on execution machines close to your user's location. An execution schedule manages the testing frequency. If an error occurs, the notification will be sent to your support teams to give them enough time to take corrective measures. The benefit is that you are more on the driver seat and in a position to detect outages even if no customer is using your applications.
Some standard active monitoring suites are:
UserView, BrowserView, ServerView from dotcom monitor
Simulation of most important use cases on production, so-called active monitoring, is good to check the availability and accuracy of your core services. However, real user's permissions, data constraints, and the runtime behavior of a business application can still have a high impact on core applications. Passive monitoring can help to get the required transparency.
The term "passive" monitoring outlines already that this is more a listener. There is no robot-based user action simulation involved. Contrary to active monitoring, passive monitoring depends on customers' requests and collects all user transactions, including input parameters, request details, and much more application-specific details. This type of monitoring requires a monitoring agent installed on your application's servers that collects all relevant information.
Everything from application performance, transaction, user experience, and system resources to log file monitoring is under the passive monitoring umbrella. Such passive collected metrics are helpful for error analysis, continuous optimization, or capacity planning because they tell you exactly how many users have been affected and what component was the root cause of specific issues.
Some standard passive monitoring tools are
APM and UEM: Dynatrace, New Relic, Datadog
Infrastructure: Nagios, Tivoli
# Visualization of key performance metrics
Automated incident alerting is an excellent and valuable feature, but it's still not sufficient in many cases. A visualization of the current health status is recommended. Many teams are using monitoring screens and chart the core performance and health figures of their critical services. The benefit of this approach is that your teams can feel safe if important traffic lights are green. Charted key performance indicators can also be shared with the business and management team. It's also easy to re-use those dashboards for internal reporting reasons.
As a good practice, the following KPI visualization could be used
Create combined or dedicated performance cockpits
Add traffic lights to visualize actual and historic health metrics
Add accurate user stats such as action time, requests executed, the failure rate
Add active monitoring statuses such as availability and accuracy
Add host health statuses such as CPU, Memory, Network, and Disk utilization stats
# Problem Alerting
Software errors, hardware defects, or human failures can impact the performance or uptime of your business services. Unfortunately, you can't avoid such problem spots altogether. Full-stack monitoring helps you to detect such issues early and escalate those to the teams in charge. If pre-configured thresholds have been reached, an e-mail, SMS, or trouble ticket is sent out. This alerting practice is beneficial because your teams can't look 24 / 7 at their monitoring screens.
Why should you bother your engineers on and on dealing with similar issues? Wouldn't it be much better if these problems are getting resolved in an automated fashion? The good news is that latest monitoring or observability concepts are equipped with such self healing capabilities.
How self healing works?
Problem is detected
Alert is sent to automation engine
Automation engine performs corrective action
Problem alert is closed
4 steps from problem detection to resolution
Active or passive monitoring detects a deviation from expected state
An automated alert is sent out
Problem spot is highlighted on a monitoring cockpit
Self healing action is triggered
The problem ticket is closed automatically
We are not only here to share such stories but also provide consulting services and guide you on your monitoring modernization journey.
Keep up the great work! Happy monitoring!