Uncover the True Cause of Performance Incidents

Josef Mayrhofer
Jul 28, 2018
3 min read

Updated: Apr 15, 2022

We build beautiful applications, but they break in unpredictable ways. Uncovering the real cause of such deficiencies can be time-consuming and often results in one-way roads. Disasters such as the Equifax breach or outages of major online banking sites are all attributed to human error. In all those breaches it was never a fault of a single person, and there were no processes and measures in place to prevent such things from happening. In this post, I will outline a strategy on how to identify the real cause of highly complex performance hotspots.

Detection

The processes and tools in place mainly drive how long it takes to detect bottlenecks. Due to the nature of the performance hotspots they appear slowly, impact end users more and more and after a while, they block your business applications completely. According to research, only 1 of nine users report a performance slowdown, which makes the detection of performance hotspots extremely difficult.

Takeaways: Proactivity is critical. If proper processes and tools are available to support teams may detect critical hotspots before they appear to your users. Educate your teams on the detection of typical performance anti-patterns.

Notification

Manual incident detection and notification is error-prone. It can take hours to get the right teams involved who are responsible for implementing the remediation actions. Quick problem detection and escalation helps organizations to reduce their mean time to repair, which can be achieved by automation, the preferred communication channel and the assignment of problem tickets to the correct teams in charge.

Takeaway: Inform relevant parties automatically about incidents and use their preferred communication channel to minimize MTBF.

Assessment

Not every incident is critical because of minor issues such as small spikes in system resource utilization or a temporary hiccup of a shared service which have no immediate impact on end-to-end availability. Entire outages of interfaces, services, or machines can drive your business teams crazy and quick remediation is required. Assessment of performance related incidents must be done automatically in the first place based on metadata stored in your configuration data store. Takeaway: Understanding of how many end users are affected by a particular incident helps you to understand the criticality even better. Involve more senior engineers if more users are suffering due to the identified issue to increase the chance to fix such problems in shorter times. Robust monitoring and a well-structured incident process help to be more efficient in your incident assessment. Resolution The bandwidth and complexity of our incidents are enormous. Simple system resource escalations are typically a no-brainer while multiple deadlocks require more time and experience before they can be nailed down and an appropriate fix implemented. A fix for complex problems often requires some trial and error. Detailed information about a specific failure such as screenshots of error messages, affected the user, data used, causing service simplifies the implementation of a fix or workaround.

Takeaway: Incident reproduction can be time-consuming. Capturing of all end user requests 24 / 7 combined with probing of essential use cases reduces the time to fix significantly.

The human factor

We turn to automation and are hoping to reduce the chance of human errors while we forget to understand that in all our processes are humans involved. Important considerations such as usability, adaptability, the experience are often neglected and result in a lack of acceptance. Review all steps in your incident handling process from time to time, arrange feedback sessions and continuously improve the way how your teams identify and escalate critical hotspots such as performance issues. Finger pointing, or endless war room sessions are not useful at all. They kill thinking out of the box and turn your teams back behind walls. Practical experience is another not neglectable aspect. Shadowing of senior engineers is a good step for your juniors to build knowledge around the whole incident analysis and resolution process.

Takeaway: We are hoping for decades to find a silver bullet helping us to solve all performance issues. Artificial intelligence is a small step in this direction, but there is still a human factor involved because people are developing and using AI-based problem detection solutions.

Luckily, the creativity of humans can’t be outweighed by machines or algorithms. Consider these rules in all your performance incident identification, escalation, analysis and resolution processes.

Contact me if you are interested how to uncover performance hotspots in your applications before they appear to your happy customers!