RabbitMQ Checklist For Production Environments - A Complete Guide

Learn how to avoid the most common setup errors and enjoy a more stable RabbitMQ environment.

This is the checklist you should look at before (and refer to often) when you set up your RabbitMQ production environment. We have included a rundown of the most common errors, tips and tricks to create the most stable environment possible.

1. Decide the correct number of servers in your RabbitMQ cluster

First of all, you have to decide how many servers (nodes) you would like to have before you set up your environment. As a reference, CloudAMQP gives customers the option to have one, three, or five nodes. You should enable quorum for a majority by selecting an odd number of servers in your cluster.

Single node plans are the fastest, and all data written to disk is safe as long as you send persistent messages to durable queues. The data on a single node setup is always consistent since no data needs to be written to another RabbitMQ server. More nodes in the cluster give higher availability.

2. Pick a stable RabbitMQ and Erlang Version

Optimize your environment by picking a stable RabbitMQ and Erlang version with which to start. After that, make sure that you upgrade to a new version when possible.

Keeping your cluster updated with the latest RabbitMQ and Erlang versions should be a high priority, in order to get access to all the latest features. Not only do newer versions include bug fixes, give better performance, and improve the security of your data, you will find that troubleshooting gets easier if you are using the most current and stable version.

CloudAMQP´s default version is the latest recommended version by CloudAMQP.

3. Use Environment variables

Environment variables let you set values outside your application, and then use them within it. When connecting to RabbitMQ, set the connection string within an environment variable. The environment variables should be used for sensitive data, like passwords. These must be set up prior to startup, or the cluster assumes default settings are used.

By using environment variables you only need to change credentials in one place in case you need to change the server. It also prevents you from adding sensitive data to your source code, so you don’t run the risk of commiting the connection string to a repository that may be public.

4. Do not hard code IP addresses

It's a mistake to think that the service will always have the same IP address. When it does change, the hardcoded IP will have to be modified too, keep that in mind. After all, one of the advantages of RabbitMQ and message queues is to minimize hardcoded dependencies between services.

On CloudAMQP: Use the cluster hostname when connecting your clients. Do not use the IP address of the server when connecting, as that might change.

5. Publish persistent messages to durable queues

If using classic queues and if you cannot afford to lose any messages, make sure your queue is declared as "durable" when created, and your messages are sent with delivery mode "persistent" (delivery mode=2). Please note that setting a durable queue means that the queue definition will survive a server restart, not the messages in it. You can see if the queue is durable by checking in the RabbitMQ Management UI under Queues- if it says "D" it means it's durable.

6. Use Quorum Queues instead of classic mirrored queues

Quorum queues use a more efficient and reliable algorithm for replication than classic mirrored queues and provide higher throughput. Quorum queues aim to resolve both the performance and the synchronization failures of mirrored queues. Using a variant of the Raft protocol which has become the industry de facto distributed consensus algorithm, quorum queues are both safer and achieve higher throughput than mirrored queues. We recommend switching to quorum queues since they got introduced in RabbitMQ 3.8, see The reasons you should switch to Quorum Queues.

7. Use short queues

If possible in your specific use case, try to keep queue lengths short. We recommend that you only have up to 10,000 messages in all queues at any given time. When a queue is empty, with idle consumers on standby, a message that hits the queue will go straight out to a consumer. A queue that contains many messages might place a heavy load on RAM usage. When this happens, RabbitMQ will start flushing messages to disk to free up RAM which will soon start to affect the queueing speed negatively.

8. Use long-lived connections

RabbitMQ is optimized for long-lived connections. Each connection establishment is pretty resource-intensive and uses many TCP packets. Channels can be opened and closed more frequently if needed. Keep connections open, if possible. If you have a client that is unable to keep the connection long-lived you can use the AMQP Proxy.

9. Secure connections

TLS encrypts connection traffic and also provides a way to verify (authenticate) peers. These tasks are accomplished using a set of policies and procedures based on digital identities. RabbitMQ nodes with TLS enabled must have a set of certificates it trusts in a file and a private key file. For security, it is a best practice to use TLS for all connections made over the public internet. For most AMQP clients it is as simple as replacing amqp:// with amqps://.

10. Separate connections for publishers and consumers

If you use the same connection for publishers and consumers you won't be able to consume if the connection is in flow control, which will worsen the flow problem. RabbitMQ can apply back pressure on the TCP connection when the publisher is sending too many messages for the server to handle. If you consume on the same TCP connection, the server might not receive the message acknowledgments from the client, thus affecting the consume performance. With a lower consumer speed, the server will be overwhelmed.

11. Remove unused queues

Even unused queues take up some resources, including queue index, management statistics, and others. Be sure not to leave unused queues unattended, they should be removed. There are three ways to delete a queue automatically.

  • Set a TTL policy in the queue; e.g. a TTL policy of 28 days deletes queues that haven't been consumed from for 28 days.
  • An auto-delete queue is deleted when its last consumer has canceled or when the channel/connection is closed (or when it has lost the TCP connection with the server).
  • An exclusive queue can only be used (consumed from, purged, deleted, etc.) by its declaring connection. Exclusive queues are deleted when their declaring connection is closed or gone (e.g., due to underlying TCP connection loss).

12. Do not set up too many levels of Priority Queues

Each message can have a priority set when it is published, which delivers the message to the appropriate priority queue. An example use case of priority queues is if you were like us, running database backups every day. Thousands of backup events are also added to RabbitMQ without order because a customer can trigger a backup on demand. If that happens, a new backup event is added to the queue, but with a higher priority. However, each priority level uses an internal queue on the Erlang VM, which takes up resources. In most use cases it is sufficient to have no more than 5 priority levels.

13. Always configure channel prefetch

A typical mistake is to have an unlimited prefetch, where one client receives all messages. This can lead to the client running out of memory and crashing, and then all messages are re-delivered. Knowing how to tune your broker correctly brings the system up to speed without having to set up a larger cluster or doing a lot of updates in your client code. Understanding how to optimize the RabbitMQ prefetch count maximizes the speed of the system. Learn more here.

14. Use limits in RabbitMQ

Introducing limits on some common resource leaks makes RabbitMQ instances more stable. For example, the following can be limited:

  • Max number of queues per vhost
  • Max number of connections per vhost/user
  • Max number of channels per user
  • Max number of channels per connection. (“channel_max”)
  • Total max connections (“max_connections”)
  • Maximum message size (“max_message_size”)

Limits are a broad topic, so we are right now creating an entire blog dedicated to it, coming soon!

15. Don’t open and close connections or channels repeatedly.

Just like opening and closing the refrigerator door repeatedly, opening and closing connections and channels in RabbitMQ wastes energy! Avoid opening and closing connections, as it requires more TCP packages to be sent and received, giving higher latency rates.

16. The use of acknowledgements and confirms

Messages in transit might get lost in an event of a connection failure and need to be retransmitted. Acknowledgments let the server and clients know when to retransmit messages. The client can either ack the message when it receives it, or when the client has completely processed the message. Acknowledgment has a performance impact, so for the fastest possible throughput, manual acks should be disabled.

A consuming application that receives important messages should not acknowledge messages until it has finished with them so that unprocessed messages (worker crashes, exceptions, etc.) don't go missing. Publish confirm is the same concept for publishing. The server acks when it has received a message from a publisher. Publish confirm also has a performance impact, however, keep in mind that it’s required if the publisher needs at-least-once processing of messages.

17. CloudAMQP specific: Set up a firewall

A firewall lets you restrict access to your cluster so that only your servers will have access. It is possible to set up firewalls directly in the CloudAMQP console.

18. CloudAMQP Specific: Configuration tweaks

CloudAMQP configure your RabbitMQ cluster automatically, but some configurations can be tweaked in the CloudAMQP Configurations view according to your needs. If you want to tweak other configurations that are not listed below, send us an email to support@cloudamqp.com and we can help you out. Available configurations in the Configuration view are:

  • Return metrics per object instead of an aggregated value for Prometheus.
  • Server heartbeat value.
  • Maximum number of channels per connection.
  • Consumer timeout value if consumer is not acknowledging messages.
  • Memory high watermark.
  • Size in bytes below which to embed messages in the queue index.

19. CloudAMQP Specific: Set up alarms

CloudAMQP configures some recommended alarms by default, but it’s recommended to set up queue, connections, consumer, and channel alarms according to your use case.

Queue alarms: Queue alarms can be triggered to send notifications when a number of messages in a queue reaches a certain threshold for a given amount of time.

Metric alarms: It is possible to receive accurate alerts based on performance anomalies in your application by activating different metric alarms. Make sure that you activate alarms for CPU, disk space, and memory.

Connection and channel alarms: Connection leaks can cause RabbitMQ to run out of memory. Make sure your clients are not leaking connections. If you have more than 10 connections from the same host, you may have a connection leak. This is normal if you deploy many clients on the same IP.

Consumer alarms: Consumer alarms can be triggered to send notifications when the number of consumers for a queue is less than or equal to a given number of consumers, for a given amount of time.

20. CloudAMQP Specific: Export metrics and logs

You can export your CloudAMQP messages to other monitoring tools like CloudWatch, Datadog, Newrelic, Stackdriver, Librato, Papertrail, Loggly and Splunk.

Flawless RabbitMQ instance

We hope these hints, tips, and tricks help you set up a flawless RabbitMQ instance! If you need more support, please get in touch with our friendly team. Did we leave something out? Send us your thoughts and comments via an email to contact@cloudamqp.com