We have been working with RabbitMQ a long time, and we have probably seen way more configuration mistakes than anybody else. We know how to configure for optimal performance and how to get the most stable cluster. In this series are our knowledge shared! Please read part 1 for general best practice and dos and don’ts tips for RabbtitMQ.
Make sure your queues stay short
To get optimal performance, make sure your queues stay as short as possible all the time. Longer queues impose more processing overhead. We recommend that queues should always stay around 0 for optimal performance.
Use the queue type Quorum Queues
Quorum Queues is a replicated queue that provide high availability and data safety. CloudAMQP recommends the use of Quorum Queues
The reasons you should switch to Quorum Queues and design flaws of classic mirrored queues are described in the article.
Enable lazy queues
In RabbitMQ 3.6 and larger, a feature called lazy queues was added. Lazy queues are queues where the messages are automatically stored to disk. Messages are only loaded into memory when they are needed. With lazy queues, the messages go straight to disk and thereby the RAM usage is minimized, but the throughput time will be larger.
We have seen that lazy queues create a more stable cluster, with predictable performance, which will increase the availability of the server. Your messages will not, without a warning, get flushed to disk. You will not suddenly be taken by a performance hit. If you are sending a lot of messages at once (e.g. processing batch jobs) or if you think that your consumers will not keep up with the speed of the publishers all the time, we recommend you to enable lazy queues.
Cluster setup (RabbitMQ HA with 2 or more nodes)
We rerefer to the collection of nodes as a cluster.
Availability can be enhanced if clients can find a replica of data, even in the presence of failures. The ability to access the cluster even if a node in the cluster goes down.
CloudAMQP gives you the option to set up 2 or 3 node clusters. We have located each node in different availability zones (AWS), and queues are automatically mirrored, replicated ( HA ) between availability zones.
When a node fails, we have a mechanism to auto-failover to other nodes in the cluster. We have added a load balancer in front of the RabbitMQ instances, which it makes brokers distribution transparent from the message publishers.
Maximum failover time in CloudAMQP is 60s (the endpoint health is measured every 30s, and the DNS TTL is set to 30s).
Message queues are by default located on one single node but they are visible and reachable from all nodes. To replicate queues across nodes in a cluster, see the documentation on high availability, HA.
Use persistent messages and durable queues
If you cannot afford to lose any messages, make sure that your queue is declared as “durable” and your messages are sent with delivery mode "persistent".
In order to avoid losing messages in the broker, you need to be prepared for broker restarts, broker hardware failure, or broker crashes. To ensure that messages and broker definitions survive restarts, we need to ensure that they are on disk. Messages, exchanges, and queues that are not durable and persistent will be lost during a broker restart.
Federation between clouds
We do not recommend clustering between clouds or regions, and therefore no plan spread nodes across regions or datacenters. If a whole cloud region goes down, your CloudAMQP cluster will also go down - but it's not something that we have ever experienced. Cluster nodes are spread across availability zones within the same region.
You can protect the setup against a region-wide outage by setting up two clusters in different regions and use federation between them. Federation is one of the ways by which a software system can benefit from having multiple RabbitMQ brokers distributed on different machines. More information about federation can be found here: https://www.cloudamqp.com/docs/federation.html
Do not enable HiPE
HiPE will increase server throughput at the cost of increased startup time. When you enable HiPE, RabbitMQ is compiled at start up. The drawback is that the startup time increases quite a lot too, 1-3 minutes. This might affect uptime during a server restart.
Do not set RabbitMQ Management statistics rate mode to detailed in production
Setting RabbitMQ Management statistics rate mode to detailed has a serious performance impact and should not be used in production.
Set limited use of priority queues
Each priority level uses an internal queue on the Erlang VM, which takes up some resources. In most use cases it is sufficent to have no more than 5 priority levels.