This is a living document that is continually updated. Last updated September 2017.
We have during years seen different reasons for crashed or unresponsive CloudAMQP/RabbitMQ servers. Too many queued messages, high message rate during a long time or frequently opened and closed connections have been the most common reasons for high CPU usage. To protect against this we have added and made some tools available that help address performance issues promptly and automatically before they impact your business.
First of all - we recommend all users on dedicated plans to activate CPU and Memory alarms through the control panel. When CPU alarm is enabled you will receive a notification when you are using more than 80% of the available CPU. When Memory alarm is enabled you will receive a notification when you are using more than 90% of the available memory. Read more about our alarms in the monitoring pages.
General tips and tools to keep in mind once you start troubleshooting your RabbitMQ server.
Server Metrics will show performance metrics from your server. CloudAMQP shows CPU Usage Monitoring and Memory Usage Monitoring. Do you see any CPU or memory spikes? Do you notice some changes in the graph during a specific time? Did you do any changes in your code around that time?
RabbitMQ Management interface
If you are able to open the RabbitMQ mgmt interface - do that and check that everything is looking normal; Everything from number of queues to number of channels to messages in your queues.
The event stream allows you to see the latest 1000 events from your RabbitMQ cluster. New events will be added to the collection in real time. The event stream can be useful when you need to gain insight into what is happening in your cluster. It is particularly good to debug if you are running into high CPU-usage, for instance, if you have a rapid opening or closing of connections or channels or a setup for a shovel that is not working, etcetera.
RabbitMQ Log stream
RabbitMQ Log Stream show a live log from RabbitMQ. Do you see new unknown errors in your log?
Most common errors/mistakes
Here follow a list of the most common errors/mistakes we have seen on misbehaving servers:
1. Too many queued messages
RabbitMQ's queues are fastest when they're empty. If the (default) queue receives messages at a faster rate than the consumers can handle, the queue will start growing and eventually it will get slower. By default, queues keep an in-memory cache of messages that's filled up as messages are published into RabbitMQ. The idea of this cache is to be able to deliver messages to consumers as fast as possible As the queue grows, it will require more memory. A high RAM usage could indicate that the number of queued messages rapidly went up. When this happens, RabbitMQ will start flushing (page out) messages to disk in order to free up RAM and when that happens queueing speeds will deteriorate. This paging process can take some time and will block the queue from processing messages. When there is a lot to page out, it can take considerable time nd will effect the performance of the broker. The CPU cost per message will be much higher than when the queue was empty (the messages now have to be queued up). CPU I/O Wait will probably get high, which indicates the percentage of time the CPU has to wait on the disk.
For optimal performance, queues should always stay around 0.
Are you not using lazy queues? CloudAMQP is by default using lazy queues (available from RabbitMQ v3.6) instead of default queues in RabbitMQ. If you are sending a lot of messages at once (e.g. processing batch jobs) or know that you need to queue up a lot of messages, we recommend you to enable lazy queue. Lazy queues are queues where the messages are automatically stored to disk. Only when messages are needed, are messages loaded into memory. With lazy queues, the messages go straight to disk and thereby the RAM usage is minimized. A potential drawback with lazy queues is that the I/O to the disk will increase. Read more about lazy queues here.
Are your consumers connected? You should first of all check if your consumers are connected - consumers alarm can be activated from the console. The alarm will be triggered to send notifications when the number of consumers for a queue is less than or equal to a given number of consumers.
Are you running an old version of RabbitMQ? If you are running RabbitMQ version 3.4 or lower - upgrade - with RabbitMQ 3.5 and newer you can basically queue until the disk runs out unlike in 3.4 where the message index had to be kept in memory at all time. You can upgrade the version from the console (next to where you reboot/restart the instance).
We recommend you to enable the queue length alarm from the console or, if possible, to set a max-length or a max-ttl on messages in your queues.
If you already have too many messages in the queue - you need to start consume more messages or get your queues drained. If you are able to turn off your publishers then do that until you have managed to consume the messages from the queue.
Too many unacknowledged messages.
All unacknowledged messages have to reside in RAM on the servers. If you have too many unacknowledged messages you will run out of RAM. An efficient way to limit unacknowledged messages is to limit how many messages your clients prefetch. Read more about consumer prefetch here.
2. Too high message throughput
CPU User time indicates the percentage of time your program spends executing instructions in the CPU - in this case, the time the CPU spent running RabbitMQ. If CPU User time is high, it could be due to high message throughput.
Dedicated plans are not artificially limited in any way, the maximum performance is determined by the underlying instance type. Every plan has a given Max msgs/s, that is the approximate burst speed we have measured during benchmark testing. Burst speed means that you temporarily can enjoy the highest speed but after a certain amount of time or once the quota has been exceeded, you might automatically be knocked down to a slower speed.
Most of our benchmark tests are done with RabbitMQ PerfTest tool with one queue, one publisher and one customer. Durable queues/exchanges have been used and transient messages with default message size have been used. There are a lot of other factors too that plays in, like the type of routing, if you ack or auto-ack, datacenter, if you publish with confirmation or not etc. If you, for example, are publishing large persistent messages, it will result in a lot of I/O wait time for the CPU. Read more about Load testing and performance measurements in RabbitMQ here. You need to run your own tests to ensure the perfomance on a given plan.
Decrease the average load or upgrade/migrate
When you have too high message throughput you should either try to decrease the average load, or upgrade/migrate to a larger plan. If you want to rescale your cluster, go to the CloudAMQP Control Panel, and choose edit for the instance you want to reconfigure. Here you have the ability to add or remove nodes, and change plan altogether. More about migration can be found here.
Celery sends a lot of "unnecessary" messages, due to their gossip, mingle and events. You can disable it by adding --without-gossip --without-mingle --without-heartbeat to the Celery worker command line arguments and add CELERY_SEND_EVENTS = False to your settings.py. Take a look at our Celery documentation for up to date information about Celery Settings.
3. Too many queues
RabbitMQ can handle a lot of queues, but each queue will, of course, require some resources. CloudAMQP plans are not artificially limited in any way, the maximum performance is determined by the underlying instance type and so are the number of queues. Too many queues will also use a lot of resources in the stats/mgmt db. If the queues suddenly start to pile up, you might have a queue leak. If you can't find the leakage you can add a queue-ttl policy. The cluster will clean up it self for "dead-queues" - old queues will automatically be removed. You can read more about TTL here.
If you use Python Celery, make sure that you either consume the results or disable the result backend (set CELERY_RESULT_BACKEND = None).
Connections and channels
Each connection uses at least one hundred kB, even more if TLS is used.
4. Frequently opening and closing connections
If CPU System time is high, you should check how you are handling your connections.
RabbitMQ connections are a lot more resource heavy than channels. Connections should be long lived, and channels can be opened and closed more frequently. Best practise is to reuse connections and multiplex a connection between threads with channels - in other words, if you have as many connections as channels you might need to take a look at your architecture. You should ideally only have one connection per process, and then use a channel per thread in your application.
If you really need many connections - the CloudAMQP team can configure the TCP buffer sizes to accommodate more connections for you. Send us an email when you have created your instance and we will configure the changes.
Number of channels per connection
It is hard to give a general best number of channels per connection - it all depends on your application. Generally the number of channels should be larger, but not much larger than number of connections. On our shared clusters (little lemur and tough tiger) we have a limit of 200 channels per connections.
5. Connection or channel leak
If the number of channels or connections gets out of control, it is probably due to a channel or connection leak in your client code. Try to find the reason for the leakage and make sure that you close channels when you don't use them anymore. An alarm can be activated from the control panel of your instance. This alarm will trigger when the number of connections gets over a specified threshold.
6. AWS Big Bunny Users with high steal time
If you are on plan Big Bunny on AWS you are on T2-instances. If you, for example, have too high message throughput during a too long time you will run out of CPU credits. When that happens, steal time is added and your instance starts throttle (your instance will be running slower than it should). If you notice high steal time you need to upgrade to a larger instance or identify the reason for your CPU usage.
7. Memory spikes on RabbitMQ 3.6.0
If you are running RabbitMQ 3.6.0, we recommend you to upgrade to a new version of RabbitMQ. RabbitMQ 3.6.0 is not a good version of RabbitMQ - it has a lot of issues that are fixed in new versions. It's recommended to upgrade via CloudAMQP Control Panel, just click upgrade and select 3.6.6. in the node tab for your instance. (Or perform a cluster migration down to 3.5.8).
We value your feedback and welcome any comments you may have!
This is a list of the most common errors we have seen during years - let us know if we missed something obvious that often has happened to your servers. As always, please email us at firstname.lastname@example.org if you have any suggestions or feedback.