top of page

Why should Performance Engineers expect the unexpected?

SOLR is the search backbone of the world's largest websites. It comes with near real-time indexing capabilities and ensures that search queries are always fast and return the correct data.


SOLR is your proven choice, no matter how big or critical your search workloads are. In addition, it ensures replication and fault tolerance out of the box.

You can easily fall into a trap and take the performance of SOLR for granted. However, in one of our performance engineering projects, we learned that such highly optimized search engines could also be responsible for significant performance problems.

The Verdict

  • Insurance application

  • Multi-Cloud

  • Real-time cloud data protection

  • SOLR search engine

  • Insurance backend layer

  • 3rd Party integrations

  • Kafka

  • Kubernetes

  • Tomcat

  • Nginx


Performance Requirements

  • 95 % API response times < 500ms

  • 100 requests per second


Performance Engineering approach

  • Observability first

  • Shift left

  • Load testing using JMeter

  • Performance Monitoring using Dynatrace

  • Performance validation on UAT and Production


The Problem

Response times were excellent during our early load testing activities and within agreed SLAs. However, during the COVID pandemic, business was slow, and this customer had not launched his insurance portal. At the same time, new versions of almost every component came out, and the customer deployed many of them in the brand new production. Therefore, our customer hoped that the final production validation performance test would be a little exercise completed within a few days. But, he was mistaken, and response times under two requests per second scenario were up to 1 minute in the brand new production environment.


The Solution

We simulated the expected load on the UAT stage in the first step. This load test resulted in perfect response times, and everyone was happy to have a green light for deployment on production.

You can imagine that an immense frustration appeared on our customer side as we published our production validation load testing and monitoring results. Seeing the few weeks until the scheduled go-live date and response times of up to 15 seconds were not expected.


Response Time chart showing spikes during production testing


It's useless to blame anyone in such situations. Instead, we carefully investigated layer by layer and identified that the production environment was quite different. There was an online replication in place and several security features turned on, which we never used on the UAT stages. In addition, real-time indexing for SOLR was turned on.

Initially, we thought all these security features, such as Defender, were responsible for the performance degradation, so we disabled them and executed a benchmark test. But, there was minimal performance impact without these security toolings turned off.


After reviewing our full-stack monitoring results and discussing the load testing results with the developers and architects of this insurance portal, we thought we lost too much time due to the multi-cloud communication. However, we still had some monitoring blind spots because our observability solution was not installed on all the components.


The Tuning

A few days prior, our load tests on production were a failover testing exercise. They enabled real-time indexing on SOLR and online data replication. Once we learned about this significant change, we decided to

  • adjust real-time indexing for SOLR

  • use SOLR in a Single Node Cluster

  • move SOLR to the same Network Segment as our remaining components

  • and run another load test

Surprisingly, the massive spike in response time disappeared, and service latency returned to the desired 500 ms.


Lessons learned
  1. Never take performance for granted.

  2. Performance risk assessment of all changes is a must.

  3. Load test early, often, and every deployment.

  4. Agree on Load test reporting standards.

  5. Ensure you have full-stack monitoring for all critical components and see all services.


Every minor change can result in massive performance problems. If you develop business-critical applications, continuous performance validation is necessary to manage your performance risks.

I wish you happy performance engineering!


Comments


bottom of page