Updated: Feb 11
Our service portfolio is increasing, and frustrated users can nowadays quickly move away to more reliable sites. Besides this commercial aspect, downtimes can also result in serious problems with a regulatory agency. I’ll give you now some simple steps you can use right away to conduct a high availability analysis, followed by another post on how to identify single point failure risks in your critical services.
A high available system continues it’s a business function even if a sub-service crashed. Any single point of failure (SPOF) risk is not acceptable in such a system. Therefore, you should consider the redundancy of critical components and fault tolerance of your core system. Usually, a high availability analysis starts with a comparison of expected and actual availability followed by a single point of failure identification.
The expected availability is often part of service specifications or contracts with service providers. Alternatively, you can use this formula to calculate it:
Availability = MTBF / (MTBF + MTTR) Service Time: Hours service daily available for your users MTTR: Mean time to repair MTBF: Mean time between failure (Service Time – MTTR)
Let’s assume that our CRM application users are using the application 21.5 hours a day, 20 days a month and the monthly downtime is 4 hours. What is the availability of our CRM application?
Service Time: 25800 minutes per month MTTR: 240 minutes per month MTBF: 26800 – 240 = 25560 minutes per month Availability: 25560 / ( 25560 + 240) = 99.07%
Our applications consist of many sub-services such as database, middleware, load balancer, network, and external service providers. An outage of a load-balanced server will eventually not have an impact on the reliability while a crash of a single database will often result in downtime. The former is called availability in series and the latter is called availability in parallel.
Availability in series: A = Ax Ay Availability in parallel: A = 1 (1 – Ax)²
Typically, the actual availability of a given system will be calculated in the following four steps:
Prepare a block diagram of the system
Develop a reliability model of the system
Calculate the availability of sub-services
Determine the availability of the entire system
The block diagram will help you to visualize the essential components. Once you’ve created the visual representation of your services, you’ll derive the reliability model by identification of serial and parallel availabilities. Finally, you can calculate the availability of your sub-services and the entire system.
Let’s continue with our given sample CRM application which consists of three sub-services and their service time is 21.5 hours. Sub-service 1 and Sub-service 3 have an MTTR of 240 minutes and Sub-service 2 has an MTTR of 480 minutes per month.
What is the actual availability of this CRM application?
Step 1: block diagram Our CRM application consists of 3 sub services.
The example above demonstrates how to compare expected with actual availability. Keep in mind that there are measures such as proactive error detection and redundancy which improves availability and others such as complexity have an adverse impact on system reliability.
In my next post, I will outline a single point of failure analysis and provide a high availability calculation sheet that you can use right away.