Full Stack Monitoring @ Fin Tech

Regulatory agencies such as MAS or FINMA closely monitor how financial institutions build, test and operate their business-critical services. Within one hour after a critical service crashed, the agencies have to be informed. In this post, I will give you insights on how to build a reliable monitoring stack which helps you to detect and escalate critical outages automatically.

Major elements of a full stack monitoring are: 1. Active monitoring 2. Passive monitoring 3. Visualization 4. Alarming

# Active monitoring

Active monitoring mimics clicks of a real user. It helps you to see your entire application with the eyes of your end user. You select a set of top business processes and automate them with a synthetic monitoring platform. The created monitoring scripts will be executed on execution machines close to your user’s location. An execution schedule controls the testing frequency. If an error occurs, the notification will be sent out to your support teams to give them enough time for corrective measures.

Some familiar active monitoring solutions are:

Silk Performance Manager from Microfocus
UserView, BrowserView, ServerView from dotcom-monitor

# Passive monitoring

Simulation of most important use cases on production, so-called active monitoring, is good to check the availability and accuracy of your core services. Real user’s permissions, data constraints and the runtime behavior of a business application can still have a high impact on core applications. Passive monitoring can help to get the required transparency.

The term “passive” monitoring outlines that this is more a listener. There is no robot-based user action simulation involved. Contrary to active monitoring, passive monitoring depends on customers request and collect all user transactions including given input parameters, request details and much more application specific details. This type of monitoring requires a monitoring agent which is installed on your applications servers which collects all relevant details.

Starting from application performance, transaction, user experience, system resource to logfile monitoring, all of them are under the passive monitoring umbrella. Such passive collected metrics are useful for error analysis, continuous optimization or capacity planning because they tell you exactly how many users have been affected and what component was the cause of specific issues.

Some passive monitoring tools are

APM and UEM: Dynatrace, New Relic, AppDynamics
Infrastructure: Nagios, Tivoli
Logfile: Splunk

# Visualization

Automated incident alerting is an excellent and useful feature, but in many cases, it’s still not sufficient. A visualization of the current health status is recommended. Many teams are using monitoring screens and chart the core performance and health figures of their critical services. The benefit of this approach is that your teams can feel safe if great traffic lights are green. Charted key performance indicators can also be shared with business and management team. It’s also easy to re-use those dashboards for internal reporting reasons.

As a good practice, the following KPI visualization could be used

Create combined or dedicated performance cockpits
Add traffic lights to visualize actual and historical health metrics
Add real user stats such as action time, requests executed, failure rate
Add active monitoring status such as availability and accuracy
Add host health status such as CPU, Memory, Network and Disk utilization stats

# Alarming and Incident Notification

Software errors, hardware defects or human failures can impact performance or uptime of your business services. You can’t avoid such problem spots completely. A full stack monitoring helps you to detect such issues early and escalate those to the teams in charge. If preconfigured thresholds have been reached, an e-mail, SMS or trouble ticket is send out. This alerting practice is extremely useful because your teams can’t look 24 / 7 at their monitoring screens.

From problem detection to resolution