Technological trends in recent years have increasingly moved towards the development of services that are as small, fast, and resource-efficient as possible. This has led to ‘breaking up’ old, cumbersome, monolithic applications into chunks called microservices that serve a single functionality of a platform. Consequently, we are back to the divide et impera philosophy.
What can be done, however, when, try as one might, the performance of the microservice does not come close to what is expected or, what’s more, when it has a negative impact on the environment in which it runs?
This article presents practical attempts to improve the performance of a real microservice released into production, alongside their advantages and disadvantages. We hope our strategies also apply in other contexts or will at least light the spark of inspiration for those facing similar situations.
What are we facing?
As we all learned in middle school, every problem must have a hypothesis and a conclusion. Although we promised a story in the title, you’ll forgive us if we mix literature with maths. Our hypothesis goes something like this:
Once upon a time, there was a microservice (let’s call it Micron) past its prime, written in Java 11, exposing a REST API, using a caching mechanism running in production on 6 VMs (virtual machines). On the data storage side, Micron used a single node of a MySQL database, which was also deployed on a virtual machine. At the platform level, this microservice was among the most used (receiving an average of one million requests per hour) and was essential to users’ entire website surfing experience. At the same time, requests from clients worldwide were redirected from different spoke environments to the main hub environment, i.e. to the 6 available VMs. The distribution of requests to the VMs was done through a “gateway”, a proxy, which employed a least connections algorithm and whose last software update dates back to 2017. During the time intervals when the platform was the busiest, Micron showed very poor performance, increased response times, the number of returned errors was significant, and the CPU usage level of the database was close to, sometimes even reaching, the maximum allocated. Another microservice, a reactive one this time – Micron 2.0 – which served the same functionalities and would come as a replacement for Micron was under development, but the solution to the overload problem needed to be found urgently.
Conclusion?
For a better platform usage experience, both at peak and off-peak hours, Micron’s response times needed to be improved, the number of errors reduced as much as possible, and the extent of database utilisation minimized to prevent potential incidents involving the loss of stored data.
What should be done?
Analysis performed on the text brings to mind a Prince Charming in distress. Like any self-respecting main character, Micron needs a makeover to improve their form, but this couldn’t be done without trusted sidekicks’ help.
The first brainstorming session reveals that the task at hand contains one high-priority component: reducing the number of database queries while aiming at increasing the speed of the service and decreasing the percentage of wrong answers. There are various ways in which these results could be achieved:
- Making the service reactive and asynchronous would allow more threads to run in parallel, thus decreasing response time. However, this idea is invalidated because the Micron 2.0 initiative is already underway, and the extra effort would be in vain.
- Limiting the number of open connections – while this seems like a feasible solution to ensure the correctness of returned information and decrease the rate of queries sent to the database, we cannot afford to risk drastically worsening the response time, as the entire behaviour of the platform depends on requests to Micron.
- Rewriting all SQL queries to avoid the N+1 issue would be the solution to one of the most common causes of performance degradation for services that use relational databases. Let’s imagine that in one table, we store a list of cars, and in another table, a list of wheels, with a One-to-Many relationship between the two. To find out some information about the wheels, the service will perform a query to the database that will return the complete list of cars, and for each of the N cars, it will create another query to get the corresponding list of wheels, which the service will then process. In total, N+1 queries are performed, which could be reduced to a single one: the complete list of wheels, obtained directly from the corresponding table, to be stored in memory and processed later.
- Scaling the service – the higher the number of VMs, the lower the request/VM ratio. However, we can’t increase the number of virtual machines as easily as we would increase the number of pods in Kubernetes, because the process is a costly one for the company, both financially and in terms of the work done by multiple teams. Moreover, the possibility that performance will not live up to expectations is still on the table. Therefore, it would be penny-wise but pound-foolish. This initiative would also not achieve the primary goal of reducing the struggle of the database.
- Using all available database nodes instead of just one – perhaps the safest way to minimise CPU usage. DB queries would go to three nodes instead of one without causing overhead. The problem, however, moves to the implementation side, taking into account several aspects, such as the topology of the database or the algorithm for assigning VMs to nodes.
What do we choose?
As mentioned in our hypothesis, Micron has reached an old age where they no longer readily accept major changes. As a result, we resorted to the options that would work best in the current context: manually rewriting queries to reduce the number of database interactions and implementing a mechanism to connect VMs to all three available MySQL nodes.
On the one hand, resolving the N+1 queries issue is time-consuming work relative to the complexity of the service, but fortunately, all the necessary information is available. In other words, we are talking about a process that involves making changes to the classes in the project that deal directly with requesting information. Blood, sweat, and tears might be required…
On the other hand, connecting a microservice to all nodes of a database raises several issues:
- How do we avoid deadlock situations when changes are intended to be made to the same data but by different VMs on different nodes?
- How will the system decide which of the three nodes to connect to?
- How do we implement sticky sessions to ensure that a client will keep their connection to the same VM, and therefore to the same data source, throughout a session?
More than likely, there are other factors that I have not considered or detailed, but this only denotes their lack of importance. You may rest assured that the burning issues have been carefully evaluated.
Therefore, taking into account the ecosystem in which Micron runs and the options we have, we’ll resort to changes both in the microservice codebase (a retry mechanism is implemented to enable all write operations to the database to be retried if conflicts occur) and outside of it. Although there is a risk that the additional logic will slow the system down, it is believed that situations where conflicts appear will be rare and that data integrity and lack of errors are more important than achieving the minimum possible response time.
Regarding the aspect of using all available MySQL nodes, it is possible to override, for each VM, the data source URL properties. Thus, knowing that we have 6 VMs and 3 nodes available, we intend to connect a pair of VMs to each node. While this solution does not guarantee permanent stability or perfect balancing of the number of requests sent to each node, it is an important step in the right direction and does not present the same complexity that, for example, creating a codebase-level decision algorithm would have.
To address the third problem, the option is to decommission the current proxy and create a new NGINX proxy that will run in Kubernetes and act as a demultiplexer. It will ensure the existence of “sticky” sessions both between the client and one of its pods (sticky session cookies at Ingress level) and further on between itself and one of the VMs. The distribution algorithm used is hashing – it relies on the IP attached to the request, along with other relevant data – and redirects client requests to one of the VMs according to the resulting code. As a consequence, requests coming from the same place will go to the same destination for the duration of the same session. Choosing to use such software avoids the risk of having to change the distribution algorithm within the existing proxy, i.e. having to update it after 6 years of being ignored. There’s no use in waking up a sleeping bear…
What have we learned?
The order in which changes were made to the service was not random. First, improvements to Micron’s code were implemented, related to the N+1 problem and the retry mechanism. Second, the new proxy was created, and the third stage was to redirect the microservice VMs to all the database nodes. Performance tests were run at the end of each phase, monitoring the system’s behaviour in each situation. Continuing the parallel with a Prince Charming fairy-tale, Micron had to go through various trials by fire until reaching the happy ending…
The data from each endurance test (run over 8 hours, with a total of 3810 users) can be found in the table below:
Software under test | Average response time (ms) | Percentage + total number of 500-like errors | Total requests (mil) |
Micron – initial version + initial proxy + 1 MySQL node | 67.66 | 0.00001% (924) | 54.29 |
Micron – final version + initial proxy + 1 MySQL node | 39.84 | 0% (0) | 54.32 |
Micron – final version + NGINX + 1 MySQL node | 38.26 | 0% (0) | 54.32 |
Micron – final version + NGINX + 3 MySQL nodes | 30.51 | 0% (0) | 54.34 |
It is easy to see that the biggest improvement came through adjustments in the codebase made in the first testing stage. In addition to decreasing the average response time by a resounding 41%, it was possible to eliminate server errors. However, a more detailed comparison revealed that some APIs, especially those performing write operations on the database, had a worse response time than the one achieved in the initial test. This indicates the “necessary evil” nature of the retry mechanism, the trade-off made at the expense of speed to decrease the error rate.
The second round of tests, this time run in conjunction with the new demultiplexer, had approximately the same results as the previous one. Nevertheless, the desired effect was achieved, with the test proving the proxy’s ability to do its job well without causing any failures. Although the distribution of requests to the VMs was not perfectly balanced because the test engines have IPs that, more often than not, fall into the same category (hash bucket), each of the 6 virtual machines running the microservice performed at a maximum capacity of 50%. Thus, it can be said with confidence that in actual use cases of the platform, none of the service’s instances will end up overloaded.
The last round of endurance testing Micron underwent was the “icing on the cake”, combining all the efforts made so far. By distributing the load across all three MySQL nodes, the average response time was 54.9% (!!!) better than in the first test and 20.5% better than in the most recent round of testing. The most significant achievement, however, was the reduction of the CPU utilisation allocated to the database, from 100% on the single node used initially to about 30% on each of the three nodes.
How does the story end?
The prince married the princess, uniting the two kingdoms, and they lived happily ever after. Or, at least, a relatively similar version, which can be applied to technology:
Based on the stellar results of the performance tests, it was decided to launch the new version of the service, along with all additional configurations. Even though the release process was a bumpy road, Micron weathered the elements and successfully received a version update. Since then, there have been no incidents of database overload or loss of data integrity; no customer complaints were registered, and the platform no longer requires occasional refreshes.
In the end, we never know when or why the next production outage will happen. As Margaret Mitchell’s Scarlett O’Hara said, “After all, tomorrow is another day.”