Skip to content

Improving the performance of a NodeJS server, a case study

#case-study #nodejs #worker-threads #performance
I was given the task to upgrade a small NodeJS server nobody touched in years, and also see if I could do anything about its slow response time to (especially concurrent) requests.

The business problem

In our ecosystem, we managed several applications and APIs written in different languages. One of these was a NodeJS server, which was responsible for rendering map tiles after retrieving the necessary data from a PostgreSQL database. The tile server was running in a GCP VM and it was manually deployed from a Docker image.

This NodeJS server has not been touched for years and was severely outdated. Furthermore, users reported the tile generation to be slow, especially when someone was navigating around the map and triggered re-rendering. They also mentioned that tile generation intermittently failed every once in a while, and no tile was returned.

As a part of the upgrade effort, I was also tasked with adding an APM agent to the server so we could improve monitoring.

Approaches

Gradual upgrade of dependencies and Node

Because of years of neglect, it was obvious that the server needed its dependencies updated, and any unmaintained libraries replaced. The server was using Node version 8, and an audit showed that it had more than 200 security vulnerabilities in its outdated dependencies, including 7 critical and 32 high severity ones.

The goal was to upgrade it to run on Node 20, but this could not be done in one go - at least it was risky and prone to errors. After studying the dependencies for a bit, I concluded that every dependency's latest version supported at least Node 12, so I decided to first upgrade the application to this Node version.

As a second step, I gradually updated every other dependency, starting with the most crucial ones - the Mapnik bindings and the framework. I had to replace babel with a newer version entirely, because the old one was discontinued. After this, upgrading to Node 20 was seamless.

Integrating an APM agent was a minimal effort.

Performance improvement using worker threads

The more difficult question was how to approach the speed problems. First I had to identify the nature of the issue. For this, I checked the SQL queries, I checked the Javascript framework and Mapnik binding implementation, as well as how the frontend was calling this API.

In the browser, I witnessed an alarming pattern of loading the tiles. We fired off dozens of requests at the same time, even for tiles currently outside of the user's view. Each individual tile took several seconds to retrieve, and because browsers only allow up to six open connections at a time, what ended up happening was that Chrome was loading the tiles in very noticeable batches of 6. Furthermore, rendering 50-60 tiles this way could easily take more than 10-15 seconds. This also suggested that the server might have been struggling with handling concurrent requests.

My idea was to implement worker thread pools, worker threads are Javascript's solution to handling concurrent, CPU-intensive and expensive operations, such as database queries, image rendering and more. This was not a trivial task because I wanted to keep using the existing framework if possible, and introduce only a smaller number of changes given that there was no existing test automation for this application.

I also suspected sub-optimal or perhaps missing caching, which if fixed/implemented, could further improve performance in the long run.

How I measured success

Auditing the dependencies could easily verify if the application was secure and up-to-date from a package management point of view. But I also had to be able to validate any performance improvement that was achieved.

I had to verify the existence of the performance issues, because I needed a baseline to measure any kind of improvement in the future. So I set up one of the endpoints in Postman that I knew was able to return a vector tile, and ran a standard performance test with 30 VUs for 3 minutes. The average response time was over 14.000 ms (part of the reason it was so high is because this was measured on my local machine that was connecting to a development database with real data overseas, which increased the latency significantly), and it could respond to just over 200 requests in this timeframe, with an average throughput of 1 request per second. At the same time, the error rate was above 10%. The throughput was erratic, and response times varied widely throughout the load testing.

Challenges

Implementing a worker pooling logic had some caveats, because I had to decide what to move into the worker thread and what to keep out of it, and also how and when to create new pools to ensure optimal performance. When using Mapnik binding in Javascript, the general advice is to not share the map resource between threads.

I could not get it right on the first try, or the second try, which was highly frustrating. I knew I had to think outside of the box to solve this, so I attempted to reduce the difficulty of the task by eliminating the framework first, and creating a mini-app with only Express, Mapnik bindings and worker pools, because a simple Express endpoint was much easier for me to understand than a foreign framework which had its own way of dealing with middlewares, dependency injection and much more (disclaimer: I am a PHP developer predominantly and I have limited production experience with the async hellscape of JS frameworks). After I got this POC working, I used my findings to be able to implement the same within the original framework, utilizing its middleware to manage the pooling logic.

After I got it to work, it was time to measure the performance again.

Results

The audit reported that every known security vulnerability was eliminated by updating the dependencies.

To present the performance difference, I prepared 3 different scenarios: 1 VU, 5 VU and 10 VU, and rendering the results side-by-side for comparison.

It was evident, that using worker threads improved even the single user performance when it was querying the same endpoint (a common use case for a single user rendering a full map, as every tile would use the same pool). In the upgraded version, the throughput showed a much healthier pattern: the initial response time when the resources were first initiated for the threads were higher, but then the response time dropped drastically and stayed steadily around the 500 ms mark - a 3x faster response than the original design.

Pre-upgrade Post upgrade
1 VU 1600 ms 510 ms
5 VU 3600 ms 500 ms
10 VU 7600 ms 500 ms
30 VU 25088 ms 1073 ms
Table 1: Avg response time, as measured in my local environment, connecting to a remote database in North America - production numbers were <100ms

But the difference truly became evident when we added more virtual users to the test. In the original design, the average response time was climbing with the increase of concurrent users while post-upgrade, the response time stayed the exact same as for a single user - all at the cost of minimally increased memory consumption. This was evidence that the new design provided faster response even for a single user, but especially for multiple concurrent users.

Pre-upgrade Post upgrade
1 VU 0% 0%
5 VU 0% 0%
10 VU 1.5% 0%
30 VU 19.57% 0%
Table 2: Error rate

Studying the error rate provided interesting conclusions as well. The original design did not handle concurrency very well, and with the increase of virtual users, the error rate began to climb as well, mostly due to timeouts. In the upgraded application, the error rate stayed zero even when the original was already well above 10% error rate. This proved that the new design was able to deliver responses more reliably, without unexpected failures.

End of article