Part 3: RabbitMQ Best Practice for High Availability

Many variables feed into the overall level of availability for your RabbitMQ setup. In Part 3 of RabbitMQ Best Practice are we talking about recommended setup and configuration options for maximum uptime. We will mention standard settings, changes and plugins that can be used to receive better availability.

We have been working with RabbitMQ a long time, and we have probably seen way more configuration mistakes than anybody else. We know how to configure for optimal performance and how to get the most stable cluster. In this series are our knowledge shared! Please read part 1 for general best practice and dos and don’ts tips for RabbtitMQ.

Make sure your queues stay short

To get optimal performance, make sure your queues stay as short as possible all the time. Longer queues impose more processing overhead. We recommend that queues should always stay around 0 for optimal performance.

Use the queue type Quorum Queues

Quorum Queues is a replicated queue that provide high availability and data safety. CloudAMQP recommends the use of Quorum Queues

The reasons you should switch to Quorum Queues and design flaws of classic mirrored queues are described in the article.

Enable lazy queues

Update: Starting from RabbitMQ version 3.12, the queue mode will be disregarded as classic queues will now exhibit similar behavior to lazy queues.

In RabbitMQ 3.6 and larger, a feature called lazy queues was added. Lazy queues are queues where the messages are automatically stored to disk. Messages are only loaded into memory when they are needed. With lazy queues, the messages go straight to disk and thereby the RAM usage is minimized, but the throughput time will be larger.

We have seen that lazy queues create a more stable cluster, with predictable performance, which will increase the availability of the server. Your messages will not, without a warning, get flushed to disk. You will not suddenly be taken by a performance hit. If you are sending a lot of messages at once (e.g. processing batch jobs) or if you think that your consumers will not keep up with the speed of the publishers all the time, we recommend you to enable lazy queues.

Cluster setup (RabbitMQ HA with 3 or more nodes)

We refer to the collection of nodes as a cluster.

Availability can be enhanced if clients can find a replica of data, even in the presence of failures. The ability to access the cluster even if a node in the cluster goes down.

CloudAMQP gives you the option to set up 3 or 5 node clusters. We have located each node in different availability zones (AWS), and queues are automatically mirrored, replicated ( HA ) between availability zones.

If a node fails, the client will automatically reconnect to another node in the cluster.

You can read more about setup options on the different number of nodes in your cluster here and about RabbitMQ clustering here.

Message queues are by default located on one single node but they are visible and reachable from all nodes. To replicate queues across nodes in a cluster, see the documentation on high availability, HA.

Use persistent messages and durable queues

If you cannot afford to lose any messages, make sure that your queue is declared as “durable” and your messages are sent with delivery mode "persistent".

In order to avoid losing messages in the broker, you need to be prepared for broker restarts, broker hardware failure, or broker crashes. To ensure that messages and broker definitions survive restarts, we need to ensure that they are on disk. Messages, exchanges, and queues that are not durable and persistent will be lost during a broker restart.

Federation between clouds

We do not recommend clustering between clouds or regions, and therefore no plan spread nodes across regions or datacenters. If a whole cloud region goes down, your CloudAMQP cluster will also go down - but it's not something that we have ever experienced. Cluster nodes are spread across availability zones within the same region.

You can protect the setup against a region-wide outage by setting up two clusters in different regions and use federation between them. Federation is one of the ways by which a software system can benefit from having multiple RabbitMQ brokers distributed on different machines. More information about federation can be found here: Federation - Migration, Exchange and Queue Federation

Do not enable HiPE

HiPE will increase server throughput at the cost of increased startup time. When you enable HiPE, RabbitMQ is compiled at start up. The drawback is that the startup time increases quite a lot too, 1-3 minutes. This might affect uptime during a server restart.

Do not set RabbitMQ Management statistics rate mode to detailed in production

Setting RabbitMQ Management statistics rate mode to detailed has a serious performance impact and should not be used in production.

Set limited use of priority queues

Each priority level uses an internal queue on the Erlang VM, which takes up some resources. In most use cases it is sufficent to have no more than 5 priority levels.