OliviaGardiner - Message queue replacement, a case study

On this project, we had to reduce the time it took for a background job to deliver downloadable content for the client. The existing message queue solution - RabbitMQ - was not ideal for this, so we had to look for an alternative queuing solution.

The business problem that we had to solve

Our application was using RabbitMQ for a message queueing tool. RabbitMQ is a push-based system and it is optimized for throughput, which it achieves through prefetch, a mechanic that allows multiple messages to be pushed to a consumer at a time, a minimum of 1 message per queue.

Because this was a monolithic application, consumers were uniform and not distinguished by the types of jobs they were processing, every consumer processed both long-running data-parsing jobs, and shorter user-facing jobs, such as report or chart generation. This created an issue where it was possible for a user-facing job to be held up behind a much longer process due to RabbitMQ's prefetch, even if there were available idle consumers running.

Approaches

Considering the project's expected lifetime, there were only two avenues that were ever considered for a solution. One of them was replacing RabbitMQ with enterprise Kafka, which was a robust, pull-based messaging solution already widely used by other parts of the organization. The other option was to use one of our existing database instances, PostgeSQL (with DBAL transport), or Redis as a message queue.

The usage of Kafka was rejected, because even though my POC looked good, it was too complex for the application's much smaller needs, and accommodating to the organization's existing architecture would have meant a lot of development overhead, resulting in higher estimation than product management was willing to commit to.

The database-based alternative looked much more promising, because it would have allowed us to keep using our existing third party library, we would just have to switch the transport adapter. Redis was rejected because it did not support message prioritization in the way we needed it to, while DBAL-transport did, and the estimation for its implementation was also suitable for product management.

Challenges

The message queue communication became very fast because it could use the same open connection towards the database it was otherwise using to interact with data. But the application was using read-only replicas for PostgreSQL, and this millisecond speed improvement caused that in some cases, when a new record was created and a message for its processing was pushed, it was picked up by a consumer faster than the replica synchronization could happen, resulting in records not being found because the application was set to perform SELECT queries on the read-only connections.

To solve this, we had to ensure some processes were using the primary connection for read operations as well - this has not made a huge impact on the database as most of these processes were very write-heavy anyway, so they would have had to switch to the primary connection at some point.

The regular QA process also uncovered a performance bottleneck, namely that report generation had the potential to time out due to insufficient memory inside the kubernetes pod when certain conditions were met. After establishing that this was not a regression introduced by the architectural change, I was able to improve the code to reduce the memory consumption, more than doubling the maximum file size the application was able to generate. Further optimization was not possible without significant business logic changes, so I recorded the proposal and went on to analyze the risk of existing customers hitting this bottleneck.

The data revealed that with the memory optimization, this performance bottleneck became an extreme edge case which would be a very unlikely, and currently non-existent use case for our customers. We advised product management on this small risk and communicated workarounds and troubleshooting options to client services.

How we measured success

Non-functional requirements were not clearly defined for this project, so evaluating the success of this solution was another problem we needed to solve. We decided to take a data-driven approach, as we were already recording metrics about what time the message was pushed to the queue, what time the message was picked up for processing by a consumer, and what time did the processing finish. With these metrics, we were able to calculate how much time certain messages spent waiting.

It is also important to note that at the time, kubernetes auto-scaling was not enabled for this application, so we were running with a static number of consumer instances. While before, a single long-running process could hold up multiple queues and leaving the majority of consumers idle, from this architectural change we expected new messages to be picked up immediately, as soon as there was an available consumer, since the majority of our processes would not run longer than a few minutes at most.

So we have created scenarios where we observed processing delays in the past, and measured the processing delay in both architectures. These two scenarios were the following:

There was a long-running process in the message pool, but there were many available consumers when a high priority message was added.
There were more messages in the pool than available consumers when a high priority message was added.

Evaluating the results

In the RabbitMQ architecture, the average time a high priority message spent waiting for processing was 19 seconds, and 64% of the messages of this type were affected by some degree of delay, up to a maximum of more than 60 minutes in extreme cases. Based on the standard deviation, it took an estimated 126 seconds for 95% of all messages to be picked up by a consumer.

	RabbitMQ	PostgreSQL
Avg delay	19s	0s
95th percentile	126s	15s

Using PostgreSQL as a message queue reduced the average delay to zero, and the maximum delay to just a few minutes, and 15 seconds for the 95% of all messages to be picked up, which was a drastic improvement.

In the scenario where a long-running process would previously hold up the high priority message, the lack of prefetch allowed an idle consumer to immediately pick it up in the new architecture, eliminating this problem entirely.

In the scenario where the message pool was full of all sorts of messages, and their number exceeded the number of available consumers, the high priority message was picked up as soon as the first consumer was freed up, which never took more than a few minutes in the new architecture.

This convinced product management that in terms of performance, it is worth moving forward with this solution.