How to avoid that your complex systems fail in complex ways

With the rise of technology, the complexity of our business applications has dramatically increased. As a result, virtualization, microservices, and artificial intelligence are about to dominate our IT landscape soon.

Expensive downtimes

In one of my recent projects several years ago, I was tasked with a firefighting exercise. Customers were unable to configure their mobile device contracts; sales staff started to open new contracts via the paper-way, and back-office teams were overloaded with a high number of paper requests. The reason for this nightmare was a performance bottleneck in the new contract management system. As a result, IT teams started log file analysis, added more hardware, and restarted their business applications several times daily. After some days, it was apparent whether more Infrastructure or log file analysis was an appropriate solution to fix the issue.

Our biggest concern was that there were not enough insights. The system resource metrics and log files indicated no issue at all. However, the customers were extremely frustrated about the reliability of this new business application. A closer look at the log files, which was another challenge due to the distributed architecture, pointed out that all service response times slowed down after a while. After this initial analysis, we agreed that it would take ages to nail down the problem spot with the limited monitoring information.

Modernization of the monitoring stack

Due to the high pressure and loss in revenue, we decided to close the visibility gap in the application monitoring chain. Priority one was on insights into user transactions, response times, and continuous service response time monitoring. We decided to integrate an application performance monitoring solution for this purpose. Priority two was on 24 x 7 health monitoring and automatic alerting if exceeded thresholds. As a result, we automated significant end-user interactions and executed those in the availability monitoring solution on a 5-minute schedule.

Quick error detection and remediation

Thanks to the improved monitoring stack, we were able to identify the cause of that nasty issue. The integrated APM platform captured details such as insufficient threads, JDBC pool sizing not appropriate, deadlocks, chatty application, and high garbage collect suspension times. We removed those hotspots after some tuning cycles, and the overall health reached the expected level. Support teams installed watch machines and permanently displayed health and performance metrics on their online monitoring dashboards.

Make monitoring part of your development and testing activities and share your metrics across the organization.

This continuous insight will help eliminate hotspots proactively and keep you from costly outages.

Happy Performance Engineering!

#ComplexSystems #Monitoring #Performance