Distributed System / 7 min read / 9 views

Choosing Message Queues for Real Systems

A practical guide to choosing the right message queue for async jobs, event streaming, retries, and scalable services.

Choosing Message Queues for Real Systems

We live in an era where it is rare for everything to be done in just one request-response cycle. Take examples of placing an order, uploading a file, or sending a message. These might look like simple tasks, but in reality, a bunch of other tasks are being done synchronously or asynchronously in the background. Some work must be done immediately, while others can wait for some time, and some has to be done even if the services are temporarily down.

To solve this problem statement, message queues have become an important part of almost all real-life applications.

Message queues do not just act as an intermediary between services. If implemented correctly and thoughtfully, it helps in the distribution of tasks, protocol of failure retries, event storage, and independent evolution of different services.

Although there are so many different message queues available and in many cases they can be used interchangeably, they are designed to solve different problems. Their access patterns and the workload that they are designed to handle are different. Some of them are RabbitMQ, Apache Kafka, Amazon SQS, Redis Streams, NATS, and Google Pub/Sub.

RabbitMQ

In RabbitMQ, producers publish messages to exchanges. Exchange then sends it to different queues based on routing and binding rules. Queues are basically ordered collections of messages. Consumers, after consuming the message sends back an acknowledgement. This is ideal for scenarios when you want to have task distribution with routing, along with retries on failures.

Take an example of an e-commerce platform. When a user places an order, the producers can publish multiple messages, such as:

order.created
payment.capture.requested
invoice.generate
email.confirmation.send

If we look closely at these tasks, we can say that different consumers can process these tasks independently. RabbitMQ works really well here as these are independent jobs. Each job should be picked by exactly one consumer, acknowledged if processed successfully, retried on failure, or moved to the dead-letter queue if the number of retries exceeds the threshold, so that it can be picked up at a later time.

Use cases of RabbitMQ:

To run background tasks: Workers pick up tasks asynchronously.
Set up email or notification systems: Failures can be retried.
Task Routing: Exchanges provide flexible routing. I have personally used it once when I wanted to route the gpu intensive task to a consumer with a GPU.
Business Workflows: Due to acknowledgements, messages don’t get lost.

Apache Kafka

On the official website, Apache Kafka describes itself as an open-source distributed event streaming platform. The reason is that its internal architecture is not that of a transient buffer; rather, it is an immutable, distributed commit log. Kafka consists of topics, partitions, offsets, retention policies, and consumer groups. Within a partition, Kafka delivers messages in order, while a consumer group allows consumers to divide the workload across different workers.

To understand its usage, we will look at some possible events in a ride booking service:

ride.requested
driver.assigned
driver.arrived
ride.started
ride.completed
payment.settled

There could be multiple services that need to get data regarding these events. Some fraud-related services may want to find if there is some unusual activity going on or not, pricing services may want to get info on the number of requests they are getting to tweak the prices, etc.

Kafka is useful in scenarios where data can be retained and replayed. For example, I can introduce a new analytics service, and this service can still receive all the events without asking the producer to produce them once again. Some other useful scenarios can be:

Event-driven architecture: Multiple services want to consume the same event’s data
Data Pipelines: Events can continuously flow into warehouses and lakes.
Audit Trails: Data is retained so it is easy to read all the event information.
High Throughput Streams: Kafka is designed to be highly scalable.
Replay Requirements: Consumers can reprocess the events by updating offsets.

Kafka should not be used for small background jobs, as it is designed for much more operational and conceptual complexity when compared to a traditional task queue.

Amazon SQS

SQS is usually used when you don’t want to have your own broker, and you do not want to manage the queue. It supports high throughput at least once delivery and can be configured to be a FIFO queue when ordering and deduplication are important.

A common real-world use is image processing. When a user uploads an image, the upload service can send a message to SQS. Consumers receive the message published by the producer through SQS and then process the image asynchronously. If the worker fails, the message will become visible again after some time, and then retries can be done.

Common use cases for SQS are:

AWS-based systems: AWS services make it extremely easy to configure SQS into them.
Simple async jobs: No need to manage a broker. All things are taken care of by AWS.
Burst handling: It is designed to absorb spikes in traffic.
Serverless workflows: It is often used in pairs with AWS Lambda.
FIFO: FIFO queues support the processing of messages within message groups in an orderly fashion.

SQS is great when your application is running on AWS, especially when the team does not want to have running infrastructure for the reliability of messages.

Redis Streams

Redist Streams can be used when your system is already using Redis, and you need a lightweight event stream with consumer groups. On the official website, it has been described as useful when you want to record and simultaneously syndicate events in real time. Redis consumer groups allow multiple consumers to process different entries from the same stream.

We can use it in a real-time activity feed. Users can perform multiple actions such as:

user.followed
post.liked
comment.created
profile.updated

These events can then be appended to the Redis Stream. Using the streams, consumers can then update feeds, send notifications, update counters, etc.

Redis streams can be useful when:

Lightweight event processing: The stream model is extremely simple, following the
Redis already exists: Infrastructure overhead would be less.
Real-time features: It is a good fit for fast-moving events.
Consumer groups: Streams support consumer groups which let multiple workers to share the processing of data.
Temporary event history: Streams can only retain recent events.

Redis Streams has a lot of pros, but since it retains only recent events and supports at least once delivery, we should be careful when using it for critical long-term event storage.

NATS and JetStream

NATS is preferred when the requirement is for lightweight, high-performance messaging between different services. JetStream adds persistence, replay, and durable consumers. JetStream consumers also track delivery and acknowledgements.

It is useful for scenarios where fast communication is needed. For example:

inventory.update
pricing.calculate
device.connected
session.updated

NATS is designed when low-latency communication is a must, and the system does not want a complex operational model.

NATS use cases are:

Low-latency service messaging: Lightweight protocol and architecture.
Request Reply between services: Consumers can send back a response for the messages they consume. Natural fit for service communication.
Event Notifications: Simple publish-subscribe.
Durable messaging: Persistence and replay are supported with JetStream.
IOT systems: It combines extreme performance and lightweight operations.

Google Pub/Sub

Google Pub/Sub provides a managed asynchronous messaging service that decouples communications between services. It consists of topics and subscriptions. Subscribers can pull messages, or Pub/Sub can push data to an endpoint. This supports ordering of messages when ordering is enabled for that subscription.

We can use it in a cloud analytics pipeline. Each Subscription can feed different consumers at the same time. Pub/Sub is useful if your application resides on Google Cloud Platform and there is a need for a scalable, managed event backbone.

Common Mistakes in Real Systems.

Assuming Queues removes the need for idempotency: Most of the queue-based systems can deliver messages more than once under failure conditions. Consumers should be designed to handle idempotency.
Putting too much business logic into the queue layer: A queue is designed to transport work or events. We should keep the business logic minimal on the queue layer.
Ignoring Dead Letter Queues: The entire system should not remain blocked because a message is failing repeatedly. This particular message should be moved aside, inspected, fixed, or replayed so that the system continues to work smoothly.
Choosing Kafka for every asynchronous task: Kafka is an excellent choice for event streaming, but a simple email job queue does not always need Kafka.
Choosing a queue only because the team already knows it: Familiarity of the tech stack to the developers matters a lot, but it should not overshadow the real purpose of the message queues.

Conclusion

Message queues are one of the most widely used tools in building reliable distributed systems, but the choice of which one totally depends on the kind of problem statement we are trying to solve.

RabbitMQ should be your choice when reliable work queues and routing are needed. Kafka would be a good option when events have to be stored, replayed, and consumed by multiple consumers. SQS and Pub/Sub are excellent when you want to manage respective cloud queuing. Redis Streams should be used for lightweight stream processing in Redis-based systems. NATS is a good option when low-latency service messaging matters.

The best is not the one with the most power. The delivery model, ordering guarantees, retry behaviour, and operational cost match your system.

Choosing Message Queues for Real Systems