Troubleshooting performance issues can sometimes be a tough nut to crack!
Customers are complaining about slow response times, and C-Level executives are increasing the pressure on their IT folks while we jump from one trial-and-error tuning to the next!
Is the answer to fully automate the process of finding, escalating, and guiding to the root cause of the actual problem behind slowdowns? Many books about tools promise that it is, but the opposite is often the case. It all comes down to having engineers with long experience tuning applications and familiarity with the entire stack involved in full-stack performance analysis.
Don't try to start your analysis in the log file or backend layer because this approach will always lead to a dead-end. You'll get very frustrated as you waste hours of valuable time and still won't identify the root cause of the performance issue.
Based on my experience, this bottom-up approach can only work if you build quality into your product and validate your features. High-test coverage and high quality on the lower layers will continuously improve the quality of the product. Always choose a top-down strategy for troubleshooting and problem analysis. Jump into the shoes of an end-user, run their use cases, and find out where the slowdown is coming from by looking from the perspectives of frontend, application, backend, container, and infrastructure.
I'll now give you a rundown of the core concepts behind this process that I call "full-stack performance analysis."
Frontend
Now, our thick client-based applications are getting replaced by modern browser-based applications. Unfortunately, we implement these web-based applications in the same way as a fat client-based one. Although many frameworks give us all the required techniques for having more logic on the client side, this approach has pros and cons. I prefer browser-based applications for several reasons, but that's part of another story. Rich browser-based applications are, in fact, a common source of performance bottlenecks. That's why I always start with a web page design analysis. It checks frontend-design best practices and gives me quick insights I can share with the developers.
Problems: content compression, large pages, blocking scripts, caching.
Tools: PageSpeed, YSlow, Lighthouse.
After a page is loaded, the real work on the client layer begins. Content is then updated dynamically while the user navigates through the application.
Problems: parsing, compiling, JavaScript, and rendering issues.
Tools: Use UEM and browser-developer tools, start a profiling session, and review
the recorded actions. Then look at the primary method and those requests taking too much time.
Application
Once the frontend developers work on the optimizations, you should proceed with the application-layer performance analysis. Start with an investigation of all the captured requests. Then review the logical and physical architecture diagram to understand the communication pattern and technological stack involved. First, check the error-prone areas such as the connection pool, configurations, existing errors, and exceptions, as well as the log-file configuration, heap configuration, and threading. These initial investigations always uncover trouble spots. Then, in the second phase, run the API-level performance tests on your application services and check how the mentioned performance metrics behave under normal and peak-load conditions.
Problems: Thread and HTTP pool-sizing issues, a high number of errors and exceptions causing an overhead, extensive logging, high-GC collection times due to heap-sizing difficulties, and slow response times due to sizing issues.
Tools: Load-testing tools such as Gatling, JMeter, NeoLoad, LoadRunner, and APM tools such as JavaMelody, Dynatrace, or Datadog.
Backend
Reliable cars strongly depend on a robust and high-quality engine. If this essential part is causing problems, you'll never be happy with your vehicle. The same principle applies to our backend and 3rd-party services. These need to be scalable and very fast. Here, we are talking about response times in milliseconds, not seconds. Many problem spots can result in buggy behavior. Errors and exceptions will cause an overhead or a logging configuration that results in high CPU or IO activities on your critical infrastructure. Also, remember to compare your environments and identify gaps between your test and prod environments.
Problems: 3rd-party services are not available or not comparable to prod, the error rate is too high, logging is too extensive, the virtual-machine configuration is invalid, pool sizing is not appropriate, and there may be some coding issues.
Tools: Transaction tracing and APM tools, memory-analysis tools, and network-monitoring tools.
Data
Data access is often a critical element in our modern applications. Slowdowns in this area can quickly create a ripple effect on many business-critical systems. Recently, I saw how one leg of a customer's Oracle rack had stopped working during peak time. The slowdown in the database layer was causing significant problems in the customer-facing business cases.
Problems: The N+1 query problem is very prominent, meaning that a transaction calls SQL queries multiple times instead of writing or reading the data only once. Another frequent problem is incorrect database-access pool sizes, outdated or missing indexes, or poorly designed SQL.
Tools: Transaction-tracing tools such as APM solutions are a good starting point. Databases are often fully equipped with a statistics checker or with tuning-advisor utilities.
Container
Microservices are a big trend. Although these offer many advantages regarding reliability, the performance tuning can become even more complex. Monitoring tools often fail to provide enough insides, and neglecting this all-important layer is easy.
Problems: Sizing your container is a must. Consider the total capacity of the underlying hardware and the needs of your microservices.
Tools: Consider using APM tools or embedded container configuration and tuning solutions.
Infrastructure
A few weeks ago, I read a whitepaper from IBM stating that the root cause of over 80 percent of all performance issues is in the application layer. I can verify this statement from my own experience with many performance-engineering projects. Only in rare cases is the hardware stack responsible for slowdowns or significant performance issues. Once the sizing activities are completed, we don't usually take the extra time to troubleshoot those lower layers. However, it's good to know the issues that may arise and which utilities should be in your toolbox to eliminate such problem spots.
Problems: First, the sizing of physical or virtual machines must be appropriate. Don't oversize machines because that's just a waste of money. In the worst case, the availability requirements can be fulfilled by the remaining 50 percent of your entire application server capacity. Secondly, background processes can take up an uncontrolled amount of memory and destroy the performance of your business apps. Ensure the separation of concerns to avoid methods of fighting for system resources. Thirdly, cleaning up all the mess, such as log files, is essential for preventing disk-full errors. If your disk is running full, your application might crash. Lastly, network interfaces should allow high throughput, and storage channels should be carefully monitored. Minor slowdowns in that area can destroy your application's performance improvements.
Tools: Full-stack monitoring solutions such as Dynatrace or OS-level utilities such as perform, top, and iostat do an excellent job in that section. Of course, it all depends on which operating system your app is hosted by. You can write a shell script to collect the metrics at 10- or 30-second sampling intervals. Check the sampling interval because a too-low or too-high frequency is equally bad.
Keeping all these things in mind is a good starting point for a full-stack performance engineer. Never limit yourself to the load injection and end-to-end response time layer. Using all our experience as performance engineers can make a big difference in the outcome of our projects. Remember that performance and IT reliability are among the essential features.
A new service needs to be usable, after all!
Happy performance engineering!
Comments