Full-stack performance analysis
Troubleshooting performance issues can sometimes be a tough nut to crack!
The scenario: Customers are complaining about slow response times, and C-Level executives are increasing the pressure on their IT folks, while we jump from one trial-and-error tuning to the next!
Is the answer to fully automate the process of finding, escalating and guiding to the root cause of the actual problem behind slowdowns? Many of the books written about tools promise that it is, but the opposite is often the case in fact. It all comes down to having engineers with long-experience in the field of tuning applications, and familiarity with the entire stack involved in full-stack performance analysis.
Don’t try to start your analysis in the log file or backend layer because this approach will always lead to a dead-end. You’ll get very frustrated as you waste hours of valuable time, and still won’t be able to identify the root cause of the performance issue.
Based on my experience this kind of bottom-up approach can only work if you build quality into your product and validate your features. High-test coverage and high quality on the lower layers will always improve the quality of the product. Always choose a top-down strategy for troubleshooting and problem analysis. Jump into the shoes of an end-user, run their use-cases and find out where the slowdown is coming from by looking from the different perspectives of frontend, application, backend, container and infrastructure.
I’ll now give you a rundown of the core concepts behind this process that I call “full-stack performance analysis”.
Now, our thick client-based applications are getting replaced by modern browser-based applications. We implement these web-based applications in the same way as for a fat client-based one. Although many frameworks give us all the required techniques for having more logic on the client side, there are pros and cons of this approach. I prefer the browser-based applications for several reasons, but that’s part of another story. Rich browser-based applications are in fact a common source of performance bottlenecks. That’s why I always start with a web-page design analysis. It checks frontend-design best practices and gives me quick insights that I can share with the developers.
Problems: content compression, large pages, blocking scripts, caching.
Tools: PageSpeed, YSlow, Lighthouse.
After a page is loaded, the real work on the client layer begins. Content is then updated dynamically while the user navigates through the application.
Tools: Use UEM and browser-developer tools, start a profiling session and review the recorded actions. Then take a look at the main method and those requests that are taking too much time.
Once the frontend developers are working on the optimizations, you should proceed with the application-layer performance analysis. Start with an investigation of all the captured requests. Then review the logical and physical architecture diagram to understand the communication pattern and technological stack involved. First, check the error-prone areas such as the connection-pool configurations, existing errors and exceptions, as well as the log-file configuration, heap configuration, and threading. These initial investigations always uncover trouble spots. Then, in a second phase, run the API-level performance tests on your application services and check how the mentioned performance metrics behave under actual regular- and peak-load conditions.
Problems: Thread and HTTP pool-sizing issues, a high number of errors and exceptions causing an overhead, extensive logging, high-GC collection times due to heap-sizing issues, and slow response times due to sizing issues.
Tools: Load-testing tools such as JMeter, NeoLoad, LoadRunner, SilkPerformer; and APM tools such as JavaMelody, Dynatrace or AppDynamics.
Reliable cars strongly depend on a powerful and high-quality engine. If this essential part is causing problems, you’ll never be happy with your car. The same principle applies to our backend and 3rd-party services. These need to be scalable and very fast. Here, we are talking about response times in terms of milliseconds, not seconds. Many problem spots can result in buggy behavior. Errors and exceptions will cause an overhead or a logging configuration that results in high CPU or IO activities on your critical infrastructure. Also, remember to compare your environments and identify any gaps between your test and prod environments.
Problems: 3rd-party services are not available or not comparable to prod, the error-rate is too high, logging is too extensive, the virtual-machine configuration is invalid, pool sizing is not appropriate and there may be some coding issues.
Tools: Transaction tracing and APM tools, memory-analysis tools, network-monitoring tools.
Data access is often a critical element in our modern applications. Slowdowns in this area can quickly create a ripple effect on many business-critical systems. Recently, I saw how one leg of a customer’s Oracle rack had stopped working during the peak time. The slowdown in the database layer was causing major problems in the customer-facing business cases.
Problems: The N+1 query problem is very prominent, which means that a transaction is calling SQL queries multiple times instead of writing or reading the data only once. Another frequent problem is incorrect database-access pool sizes, outdated or missing indexes or badly designed SQL.
Tools: Transaction-tracing tools such as APM solutions are a good starting point. Databases are often fully equipped with a statistics checker or with tuning-advisor utilities.
Microservices are a big trend. Although these offer many advantages so far as reliability is concerned, the performance tuning can become even more complex. Monitoring tools often fail to provide enough insides, and it’s easy to just neglect this all-important layer.
Problems: Sizing your container is a must. Consider the total capacity of the underlying hardware and the needs of your microservices.
Tools: Consider using APM tools or embedded container configuration and tuning solutions.
A few weeks ago I read a whitepaper from IBM stating that the root cause of over 80 percent of all performance issues is in the application layer. I can verify this statement from my own experience with many performance-engineering projects. Only in rare cases is the hardware stack responsible for slowdowns or major performance issues. Once the sizing activities are completed, we don’t usually take the extra time to troubleshoot those lower layers. However, it’s good to know the issues that may arise, and which utilities should be in your toolbox to get rid of such problem spots.
Problems: First of all, the sizing of physical or virtual machines must be appropriate. Don’t oversize machines because that’s just a waste of money. Ensure instead that in the worst case, the availability requirements can be fulfilled by the remaining 50 percent of your entire application server capacity. Secondly, background processes can take up an uncontrolled amount of memory and destroy the performance of your business apps. Ensure the separation of concerns to avoid processes fighting for system resources. Thirdly, cleaning up all the mess such as log files is important for avoiding disk-full errors. If your disk is running full, your application might crash. Lastly, network interfaces should allow high throughput, and storage channels should be carefully monitored. Minor slowdowns in that area can destroy all the performance improvements of your application.
Tools: Full-stack monitoring solutions such as Dynatrace or OS-level utilities such as perfmon, top and iostat do a good job in that section. It all depends on which operating system your app is actually hosted by. You can write shell script to collect the metrics on a 10- or 30-second sampling interval. Check the sampling interval because a frequency that is too low or a too high is equally bad.
Keeping all these things in mind is a good starting point for a full-stack performance engineer. Never limit yourself to the load injection and end to end response time layer. Using all our experience as performance engineers can make a big difference to the outcome of our projects. Remember that performance and IT reliability are among the most important features. A new service needs to be usable after all!
Happy performance engineering!