In this talk from RabbitMQ Summit 2019 we listen to Jack Vanlightly from Pivotal.

They call RabbitMQ the swiss army knife of messaging systems, but what impact do all those features have on performance? In this talk we'll review common features and their effect on end-to-end latency, throughput and broker load. We'll see concrete numbers and come to actionable conclusions that will help you make more balanced decisions.

Short biography
Jack Vanlightly ( Twitter, GitHub ) is a software engineer since 2005 with a background in event driven architectures, data engineering and automation.

Feature complete: Uncovering the true cost different RabbitMQ features and configs

I'm Jack Vanlightly. I've just been specializing in messaging systems, all types of systems, mostly RabbitMQ and Kafka. I've been involved in testing, a lot. Most of my RabbitMQ stuff, this year, has been testing. Started out with correctness testing of Quorum Queues and single writes to consumer. Now, it's branched out into performance testing. Most recently, I've been stress testing Quorum Queues and trying to make it break. I really like breaking stuff. It's nice to be paid to break stuff.

What is performance?

What is performance? With a message queue, we’re really looking at throughput. That's messages in and out, megabytes in and out, and we're looking at the end-to-end latency - the time between when a message is published to when it's consumed.

We’ve got a bunch of conditions. We’ve just got loads and loads. We're just going to explore all these different types of scenarios and loads of features. We'll just see how many features we can get through. There are too many features to combine to include in the 40-minute talk, so we won't try. I've tried to pick what I think is interesting and, hopefully, you'll find it interesting as well.

Also, I want to kind of put things in perspective because a lot of the time I'm going to be talking about 10,000 messages a second or 100,000 a second. What is your daily volume? Who has, say, 100,000 messages per day? What about a million, 10 million? Is that the region? Anyone get more, like a billion? Anyone with a billion?

Okay, so there's a few who might hit, 10,000 messages a second. Obviously, you have peaks in there as well. You might even have just periods of time where you hit many per second, but probably most people actually are more in the million range or 10 million. It's really not that many per second.

So many features…

We're going to be exploring kind of what the limits are of RabbitMQ with different configurations, and features, and seeing where it goes.

Our Hardware Configurations

I've got two example clusters, I've got a c5.9xlarge, that’s EC2. That’s 36 cores, 72 Gigs of RAM, and a 10 gigabit network. Another one, which is c4.4xlarge, which has 16 CPUs, and less RAM and stuff. I went for the c4.4xlarge because it doesn't have any burst. You have a lot of burst with smaller instance types from Amazon. I don't want to hit a burst and then suddenly throughput goes like that. It just ruins your benchmarks. Whereas, the c c5.9xlarge is the smallest one of the C5 range which doesn't have bursts.

I apologize to my employers for spending all the money on running all these benchmarks because they get a bit expensive, anyway.

Exchanges

Let's have a look at exchanges. I see the publisher to an exchange, that goes nowhere. I mean, you publish it to an exchange, there's no queues bound, not very useful. But it’s interesting to know what is the throughput limit of an exchange.

I just chose the fanout because it's like one of the fastest and we see that a single publisher on our smaller server, that’s a fourth generation CPU. Whereas, the bigger server has got sixth generation, so it's got more cores, and each core has got better single-threaded performance. We see that a single publisher can do up to almost 150,000, on these two machines.

That's just one publisher. It's not to say that's all that this particular exchange can do. It can do more. If I add more publishers, I do million a second. That's pretty good, isn't it? They go nowhere. No one gets any messages. But it's interesting to see that the exchange itself is usually not the bottleneck.

Throughput

Let's have a look at sending all the way to a queue. We’ve got one publisher, sends through an exchange. There’s one queue bound to this one consumer.

Spot the exchange…

We're going to play spot the exchange. We've got three topic exchanges. One where we're doing the exact match, one’s using star, and one’s once using the hash. Only finding key. We've got a headers exchange with one header. We’ve got a fanout, a direct, and the default exchange. Try thinking in your heads. Work it out. Let’s see if you're surprised.

Well, personally, I have never done a huge amount of performance testing of RabbitMQ before. I started at Pivotal. I was really surprised about the topic exchange, been the slowest here. I really expected headers to come in slower.

Now, I'm not saying that the topic doesn't scale out better. If you have a thousand topic exchanges, maybe that does better than a thousand headers. Also, the headers exchange does start to drop as you add headers. The more headers you add, it does start dropping. But it still does reasonably well against Topic. The default exchange is by far the fastest. Even though it's also a direct exchange, for some reason, for which my colleagues in the team could probably tell us. It's a lot faster. Then, we’ve got the fanout which is, as we saw in the cloud AMQP talk, most messages end up getting delivered to more than one queue.

Fanout Exchange

Here, we're going to look at a little test where we start out with a publisher doing a thousand messages a second. We just start adding queues to our fanout. We just see how far we can get.

At the bottom here, we see queue count. We're starting to add queues. The little blue, at the bottom, that's the publish for both of our servers because, remember, we’ve got the big server and we’ve got the medium size. We see that the outgoing is keeping in lock step with the queue count. It’s basically telling us that RabbitMQ is handling. No problem. At the end there, 1000 messages in, 10,000 out is not a problem.

What if we take it further? We see that, now at 100, both of our servers are handling it. We're getting 1000 in and 100,000 out. I have to point out, I was using a moded version of PerfTest for my own purposes and my rate limiter is a little off, it's about 10% out. You’ll see that the numbers are just out a little bit. It should be 100,000 but it goes about 10% more.

Now, as we start adding more and more, we're getting up to 200 queues, we see that the largest server, it's topped out at about 150 queues. 1000 messages in, 150,000, out and it can't do anymore. Whereas, the smaller server, it's kind of topped out at 110 queues. There's a limit to how much you can go in and out.

Latency

What about latency? In this test, we have a single publisher. We start publishing 10 messages a second. And then, 20. And then, 30. We're slowly increasing the rate. And then, we like, start doing a hundred more, a hundred more, and a hundred more. We end up on about 1000 messages a second. We'll just see what the end-to-end latency - how that grows.

On the top, we've got our smaller server. On the bottom, we've got our larger server. This is how the different types of exchange cope with it. We see that the direct and fanout are just a little bit better there. A second and one and a half second latencies. Headers is just looking like the worst, but you can definitely see it grow as we start increasing the message rate. Still, it's only about a millisecond to a millisecond and a half when we reach 1000 messages a second.

That was the 50th, and 75th, and the 95th percentile. What about the 99th and the 99.9th? We see some bigger peaks. Headers has definitely got the higher tail latencies. Direct and fanout seem to have the lowest. Topic’s not doing too bad either.

Consistent Hash Exchange

What about the consistent hash exchange? This has been mentioned. It was mentioned in the WeWork and, I think, mentioned somewhere else. And so, what is it exactly? It's just a way of scaling out a queue. Now, why would you do that? Maybe you’ve reached the throughput limit of a single queue. We saw that, through the default exchange, we were able to publish in the region of 70,000 to 90,000 a second. Those were very small messages, but that's about the limit you'll get of your queue. And if you have more than that, then you'll need to publish to multiple queues. That's one reason to use it.

Another reason is maybe you've got slow consumers. You might have quite a slow number of messages in. Maybe, you only got one a second or 100 a second, but it takes you a second or five minutes to process a message and you want some ordering guarantees when you process it, because if you add competing consumers on the same queue, you get delivery in the right order, but you don't necessarily get any guarantees about processing order. That's another reason to use it. And that would’ve solved WeWork’s situation. There's nothing wrong with doing it yourself.

One interesting aspect of the consistent hash exchange is it actually gets slower for a single publisher as you scale out the number of queues you bind to it. Here, we’ve got one queue and it's managing 75 for the big server, 65 for the little one. Then, we add a second queue. Whoa. Look, throughput just gone down loads. Well, that's not very useful, is it?

Also, take into account that one of the use cases is not to actually increase your total throughput, it’s to allow consumers to scale out while maintaining ordering. This is definitely something to take into account. And the reason is, most likely-- or one of the contributing reasons is the multiple flag. Or, that's the least thing the confirms will see, because we get the same kind of behavior with confirms. With the more queues you add, the less multiple flag usage that broker can make use of, so it becomes less efficient.

And then, we can see, as we add more and more, at the end now, we’ve got 1000 queues bound to our single exchange. The throughput of a single publisher has gone down to like 5000. It’s something to take into account or you need to be aware of. If you’ve just got a very fast publisher - just one, then take that into account, but that's just one publisher. If you've got more publishers, we don't have this problem. In fact, when we have 10 publishers, we actually see that we're reaching almost 350,000 messages a second with four queues bound to our exchange. And then, after that, it starts to degrade again.

With RabbitMQ, you'll find that, depending on your workload and your hardware, scaling out your queues, you get higher, and higher, and higher total throughput. And then, it will start going down again. The degree it goes up and down is workload dependent. Here, we see, we've got 10 publishers and we're just slowly increasing the number of queues. We also see that our big server really takes advantage here.

The latency, we see much higher latencies with the smaller server, around 3 seconds. They’re all quite bunched together, 50th, 75th, up to 99.9th. This is another thing we'll see from the consistent hash exchange and the sharding plugin is that the latencies can be a bit higher but they’re all bunched together. You don't get huge tail latency spikes. Our big server is coming in at one a second, even when doing 350,000 messages per second.

Confirm’s and ack’s

Confirm’s and ack’s. That was all without any confirms and acks. You will lose messages any time your connection fails. Connections fail all the time.

We’ve got replication, mirror queues and quorum queues, in case a node goes down. But how often do nodes go down? Not that often.

How often do connections fail? All the time. So you really, if you care about your messages, want to use confirms. But they do take a bit of a hit when it comes to your overall throughput. But remember, we've got here 100,000 messages a second at the end there with 10 queues. That's still quite a lot. You'll notice that the smaller server, in yellow, on the top there, it tops out. And then, it just hits a wall. And then, it just goes down. Whereas, the larger server just keeps going.

If we keep on adding queues, we'd see it dip again. There’s an optimum. If you have a total throughput of your breakout, there is an optimum number for a particular workload, often number of queues, depending on your hardware.

Content Header Frame

We’ve taken a bit of a hit with throughput but at least, in this case, the latency is quite similar, around about a second, a bit over a second, or about 3 seconds for the smaller server. But that's it, very high throughput.

Scaling Out a Queue / Sharding Plugin

The sharding plugin is an alternative to the consistent hashing exchange. You can use x-modulus hash exchange. It, more or less, does a very similar thing. It uses hashing function to route to different queues and usually have one exchange. The difference with the sharding plugin is it does a lot for you. It will auto-create your queues. You kind of treat all those queues as one logical queue. You only publish to one queue and, in the background, there might be 10, 20, or 30 queues, but your clients don't realize that.

And you see, the single publisher slowdown also happens with the sharding plugin but it's not nearly as severe. And so, as we are adding queues here, up to 10 at the end, we've had a slight throughput drop. Here, we have 10 publishers, we get a massive-- actually, the peak performance is about five queues with 400,000 messages a second. And then, it starts dipping again. Latency is basically identical to the hash exchange.

With confirms, again, we're seeing the same thing. The larger server, it’s able to cope with the extra queues that get added and just keeps on asking for more. If we carried on, no doubt, it would start dipping. But our smaller server does start taking a hit. It is important to do these experiments to work out what size of servers you want because if you don't need to be spending 3000 a month on the big one then there's no point wasting money.

Again, very similar latencies. You see, at the end there, the latency starts going up and that's because we saw the big dip in throughput as the smaller server was struggling to cope. As it struggles to cope, latency goes up. Whereas, the bigger server just carried on coping and actually latencies were going lower and lower the more queues we added.

Competing Consumers vs Scaled Out Queues

What about competing consumers? Do we really want to jump on to scaling our queues which has a lot more complexity? When do we want to just add consumers? It's so easy. It's really the easiest thing is just to add consumers. If you haven't reached the throughput limit of a single queue, then you should probably do that.

The only time you might not want to do it is when ordering is very, very important because you do lose your ordering guarantees of processing when you have competing consumers, all different prefetches. There’s no guarantees you get perfect ordering when you process them. Whereas, if you've sharded your queue, and maybe you have a single consumer, you could use the single active consumer or the exclusive consumer feature and get Kafka-like functionality to replicate how it does like topics and partitions. You can do that.

Latency

What about latency? Let's just say we have 10,000 messages a second. Each consumer takes 5 milliseconds to process every message, that means we need about 50 consumers to keep up with the producer.

We'll go for a hundred just to make sure. In one, we've got 100 queues bound to a consistent hash exchange and each queue has a single consumer. Or, we have a single queue and we have 100 consumers, all consuming off that single queue. Now, which one’s going to have better latency?

We know that they'll both be able to handle the 10,000 messages a second. Well, competing consumers, by far, are better. We get in around 1 millisecond, with the 99 percentile peaking at about 4. Whereas, we get this pattern with both the consistent hash exchange and the sharding plugin, you do get more latency. There is more or less close together. We don't tend to get huge spikes with either but you do get high latency.

Take these things into account. If latency is very, very important to you, I would argue, normally, 30 seconds and 30 milliseconds is not a big deal, but it's all dependent on your situation.

Single Queue

What about a single queue? We looked at some exchanges. What about a single queue? Without confirms, autoack, it's just all fire and forget. If we lose some messages, then no worries. We've got one publisher who keeps increasing its publisher rate higher, and higher, and higher until it gets to a target of 100,000 a second. We see that the smaller server kind of can't quite manage towards the end and the bigger server kind of manages to keep up. That's going to be down to the single-core performance of fourth gen versus sixth gen on the CPU because it's just a single queue.

The latency spikes at the end, for both, because they both get to a point where they’re struggling because when you reach the throughput limit of a queue - latency, there's no guarantees. But if we look at before it started struggling, latency is pretty good, 50th to 99.9th feel rather close together and they do increase as we increase the publishing rate.

Slowly increasing message size / Throughput

What about message size? In this test, we start with very small messages, 16 bytes. We double it, and double it, and double it. And we just see what happens to throughput. Up into a kilobyte by managing more or less stable throughput. And then, end up pushing around 80 meg's a second of message body. The total throughput will be higher, but that’s of a message body. If we zoom out a bit and continue the tests, doubling it until we get to a megabyte, we see throughput drops considerably for both. You might be thinking, “Oh, dear, RabbitMQ doesn't do very well with large messages, does it?”

Let's just have a look at the throughput of megabytes. We're actually pushing a gigabyte a second through a single queue. The only reason we can't push more is because we actually reach network saturation. In fact, the top one is more or less, it didn't quite saturate the network. The bottom one, you see that straight line. If you're ever doing stuff in cloud and you suddenly see a straight line, you've hit some limit. The limit here is the actual network limit of the incidence, which is 5 gigabits.

A single queue can, without replication, without anything, you know, you send bigger messages fast enough, a single queue can actually saturate your network. Let's not be so carefree with our messages. Let's actually have some security. We’ll have some publisher confirms to make sure they actually get delivered. We'll just keep the autoack and would you look at confirms?

Slowly increasing max in-flight limit

What about like virtually no latency? This is just a three-node cluster. Everything’s in the same availability zone, very low latency. What we're going to do is we're going to slowly increase our in-flight limit. That's like the number of unconfirmed messages that you can have at a time. So, your publisher’s sending messages. And then, the broker sending confirms. You just make sure you don't exceed some limit of the number of unconfirmed you have at a time.

We see that, as we allow our publisher to have more and more unconfirmed messages at a time, we see an increase in throughput. It peaks at around 15 to 20. But, if we had some latency-- this is round trip. We don't publish. And then, wait and publish. We've got like a stream of publishing and a stream of confirms coming in, but it's a round trip. Round trips are very sensitive to latency.

With just 5 milliseconds of latency, we actually see, we're seeing a direct benefit in throughput as we're increasing our in-flight limit. We keep increasing it and we keep seeing that direct jump in throughput. In fact, we have to go to-- in this case, with just 5 milliseconds, to about 300, 400 in-flight max. Now, if you take that to 10 milliseconds, or 20, or 50, you need to have a higher and higher in-flight max to become immune to the latency. You can become immune to the latency. You can actually set it high enough such that it doesn't penalize you.

Acknowledgments and prefetch

Now, on the consumer side, we also have the same mechanism. We have acknowledgments and we also have prefetch. Prefetch is the same as what I call the in-flight max, but on the consumer side. This is the number of un-acked messages the broker wilI let you have on that channel.

Ack every message

And so, we see that these two got a bit out of sync, these two tests. We see that we get benefit. This is with no latency, up to about a prefetch of about 5. We set the prefetch to 10, or 20, or 100, or even 1000, we don't really see any benefit after that, at least on this unreplicated queue.

If we had some latency, now we can start seeing that higher and higher prefetches are actually very beneficial. We actually start topping out at about 150, say, but that's acking every single message, so we're sending an ack for every single one.

Multiple flag

There's a better way. There's the multiple flag. We could send an ack 10 messages. Set the multiple flag and you've just acked 10 messages just in one operation.

So, we had in about 8000 a second, let's set the ack interval to 10% of our prefetch. If we have a prefetch of 100, then every 10th message, we’ll send an ack for the multiple flag. And so, we see that we get higher throughput. We're peaking at about 13. The prefetch, it's about approaching 200 with an ack interval of 20. This is with 5 milliseconds of latency. Again, if you had more latency, it'd be higher.

What is the ideal ratio?

What is the optimum? Well, let's use the multiple flag even more. Let's do 50% so that we’ve sent even less acks. Well, actually, we've dropped throughput now because by sending so few acks, we've kind of effectively dropped our prefetch a bit because RabbitMQ sends a bunch of messages. We wait. We wait. We wait. And then, when we say, “Hey, we've got a thousand.” And RabbitMQ’s like, “Whoa, a thousand. Right. I need to send you more” rather than doing it kind of in a more streaming way.

You can actually set it too high. If we've got, say, a prefetch of 200 and we start with 10%, so about ack interval of 20, then 40, then 60. That's pretty optimal. We're hitting our max throughput. When we hit about 40% to 50%, it drops. And then towards the end there, if our prefetch is 200 and I want to wait to get 210 messages, it's not going to go well. I'm just going to end up waiting forever and not consuming anything, so we end up dropping to 0 throughput. Be careful with that.

At least with this 5 milliseconds, 0 and 5, this one went through toxiproxy, so it could simulate latency. I don't know if toxiproxy ended up slowing things down. But, at least, in these two tests, 0 latency at the top, 5 milliseconds at the bottom with very high prefetch and in-flight. We made ourselves immune to the latency where, actually, the higher latency didn't actually cost us any throughput at all. Depending on your latency, you think about your optimum settings.

Lazy Queue

Queues, again. What about the lazy queue? Who here uses a lazy queue? A few. Okay. Cloud AMQP, do advise them because one of the most common problems they see is memory. And so, lazy queues are great because they just write to disk straight away and evict it from memory. And so, if you've got any problems with memory, then you can avoid a lot of those with lazy queues. What happens when you reach your memory limits? Your publishers get throttled and they can get throttled a lot which is very bad for performance. We really want to avoid running out of memory.

Running a few benchmarks. We got very similar for our increasing publisher rates, very similar latencies. We were able to push a gigabyte of messages. I mean, this queue is behaving just as well as a classic normal queue, at least in this benchmark.

Normal vs Lazy / Memory Usage

It has a very different memory profile. The first queue is not lazy. It is a normal queue. We publish 3 million messages which is quite a lot. Look at that, it dropped by 14 gigabytes of available memory. Then, I deleted the queue and it got it all back.

The lazy queue, look at that. A little spike. And then, another little spike at the end. Look at the difference in memory usage.

In this test, I published a million, I rest, and then consume a million. You see a little blip there and a little blip. We get a blip when the messages are kind of moving, but when it's at rest, it's flat.

With the normal queue, we get a massive drop. During the publishing, it recovers. It goes flat. But we're still a gigabyte less than we had before. And then, again, a little dip when we start consuming it.

This is with 10 million. Just to show, in the yellow there, that's the period of activity and we can see that it's more costly in memory to publish than it is when we consume for the normal queue, but it's just a little tiny blip there for the lazy queue.

Lazy queue’s, probably, you want fast disks because disks might be your bottleneck for lazy queues. But one of the most common problems with RabbitMQ is memory, so consider lazy queues. In fact, you see here, I actually had faster lazy than normal, because I published my messages a lot faster. I didn't actually end up looking into the why that was.

Single classic queue summary

You get single queue limit. You can saturate your network or your disk, depending on what type of key you're using. Publisher confirms and acks do reduce your throughput, but if you get your prefetch and your acks right then you can limit that drop. Consider lazy queues to avoid memory problems.

A Single Replicated Queue

What about replicated queues? Normally, we don't talk about replicated queues. With RabbitMQ, we talk about mirrored queues. Now, we've got two types of replicated queue. I will just be using that more generic term.

Let's have a look at a mirrored queue. It's just one queue on a three-node cluster. What’s its throughput and latencies?

We can go for different replication factors. On the end there, we have a five-node cluster. You actually see, for one node, throughput doesn't drop too much. It's actually not bad, going from two, to three, to five. Latencies in the 50th, and 75th are all very similar. At the top, there, we have our smaller server. The bottom is our bigger server. We go from 2, to 3, to 5 replication. Five would be one leader and four mirrors. It's not looking too bad.

What you'll see with mirrored queues is the tail latencies. It's got quite high tail latencies. That's where you're going to see your surprises when it comes to latency. It gets worse as we have a higher factor where you can see the blue line, which is the 95th percentile, coming in there. And we actually getting the 250 milliseconds.

What about our larger messages? We're able to push a gigabyte a second on a single node and replicate it. With the replication factor of 2, that's down to 250 to 200. It's quite a drop. To 3, we go down to 150, for the smaller one. And you see, it's flat, flat, flat when saturating our network in every single case, because even with a single node we were able to saturate it but the throughput which we reach, when we saturate the network, gets lower, and lower, and lower. You can saturate your network very easily.

Quorum Queues

What about quorum queues? What does that look like? Not bad. We're actually able to hit 30,000 messages a second, on the big one, for just one queue, with a replication factor of 3. And it goes down to 21,000. It's actually a bit faster than our mirrored queues.

The latency’s, slightly higher, here on the 50th and 75th percentile. But more interestingly, for me, the tail latencies are a lot lower. You will see this in benchmark after benchmark, quorum queues have much more stable and predictable latencies. That's something which is seen over and over again.

Message size

What about message size? Again, we pay the price for replication factor, 250 to 150 here, and lower again. Replication costs, basically.

Single replicated queue summary

Replicated queues, choose well. When you want to replicate, make sure that it's necessary. Most of you probably don't have huge throughputs, then you probably have nothing to lose from replicating. But if you do have high throughputs, then beware that the more you replicate, the higher the latencies and lower throughputs you'll see, but you do get some more stable latencies with quorum queues which is nice.

Multiple Classic Queues

What about multiple queues? So far, we've been looking at like one queue. We looked at a few multiple-queue scenarios with a consistent hash exchange. What about if we've just got a bunch of queues which are independent of each other and they're just hosted on the same broker? How does large amounts of queues do with RabbitMQ?

Well, we can see that the large cluster there carries on. We end up with 1000 queues with a target of 200 messages a second and it manages it. It reaches to 200,000 messages a second. It is starting to struggle. You see the dips. When it's not struggling, it will just be flat. It would just be stable. The more it struggles, the more you'll see this. You might see publisher and consumer rates diverge. That's when you know RabbitMQ’s struggling. The smaller server tanks a lot faster.

Optimum number of queues for total throughput?

What's the optimum number of queues? Here, we see that actually between, say, 5 and 50 queues is where it's kind of got the same amount of performance. And then, it starts going down. The more, and more, and more queues you add, the lower the total throughput that you'll get.

Confirms / Acks

What about confirms and acks? What about that? Again, we start adding queues, with a target of 200 messages per queue. We reach our limits very much earlier with publisher confirms and acks. You see the divergence of publisher and consumer rates. That's quite common when you put RabbitMQ under a lot of stress. You really want them to be together or your queue is going to be backing up. You see that they struggle a lot sooner.

One broker, multiple classic queues summary

So, 5 and 50 queues, we see that with large number of queues, our larger server, with more cores, did a lot better. It was able to use those extra cores quite well.

Cluster / Multiple Classic Queues

What about a cluster? Let’s say we’ve reached our limit on a single-- we want to scale out and we want to cluster. We just add nodes and we just get linear scaling. Nothing will go wrong.

Here, we’ve got three brokers. We’ve got the same tests. We just add queues, keep adding queues, 200 messages a second per queue. This time, you’ll notice that our smaller server, which tanks halfway through, made it all the way to the end, successfully. Publish confirms. Again, it dropped much sooner.

What about 5? Again, obviously, 5 handled the left-hand one . The left-hand one, if I didn't make that clear, is without publish confirms. The one on the right is with confirms and acks. And you see, actually, look at the difference on the confirms and acks one. We see a big difference. It really handles it a lot better with 5, so scaling out worked.

We know, many, many, many, many queues-- you reduce your total throughput. Let's try fewer queues with a higher rate per queue. A thousand messages a second, 500 queues. We reach a limit there with both. Note that they both struggle at the same time, 36 cores did not help there, but we're still seeing the benefit of 36 cores with the confirms.

With 5, again, we are getting no benefit from 5 brokers with 1000 queues without confirms. There is some limit. Maybe it's Erlang itself. I'm not sure, actually, but we reach a limit. But with the confirms, it's still keeps going. We'll see this again and again, when you use confirms, the more and more you scale out, the more you'll get.

What about 100 queues? Now, we’re getting to the sweet spot. We're reaching 400,000 messages a second. I don't know why in this test, and I did repeat it a few times, our smaller server actually did better than our larger server, except when it came to publishing confirms and acks. Yet, again, our larger server with more cores, did far better - far, far better. And also, it’s quite respectively 250,000 messages a second at around 40 queues.

5 brokers. Again, no real benefit to the extra nodes there when we're not using confirms. We do get a big benefit, again, when we're close to 350,000 messages a second with confirms and acks which is pretty good.

But I've been cheating. We talked a bit about data locality in the panel. That basically means that if our queue is on broker 2, then if we connect to that broker, we’ll get really good throughput. But if we just connect randomly, we end up roping a bunch of servers to serve a single queue. If we end up having to publish a connects to broker 1, that traffic will get routed to 2. And then, the consumer connects to 3. And so, now, all three brokers are serving one queue. What scalability benefits does that really give us? Let's have a look.

Back to the 1000 queue test. Check this out. One broker, Three brokers. One - three. One - three. This is when I didn't cheat. I just randomized the brokers through a load balancer. We didn't actually get any benefit from three brokers at all.

A cluster, multiple classic queues summary

What about with confirms and acks, because we always see a completely different profile when we use those? This time, look, we hit 100,000 before it all went wrong, before RabbitMQ got overwhelmed. If we compare, with one broker, 15,000 - 100,000. Even though we're not connecting to the local, we're still benefiting when we're using confirms. It's just RabbitMQ has to do a lot more. It's managing a lot more, doing a lot more. And so, we don't see the drain and the loss of throughput.

When we're not using confirms and acks, we don't always see a larger benefit of a larger server, when we're using it in a cluster. We probably won't see linear improvement in scale. Connecting to your local broker, you’ll get much better latency, much better throughput.

With confirms and acks, we still benefit. We still benefit even if you don't connect to the local one.

Consistent Hash Exchange

Based on the consistent hash exchange and a cluster this time. With the one broker, we were able to reach almost 350,000 messages a second. What about three? Well, we didn't improve. We didn't improve on our peak because we just seemed to have reached the limit that Erlang can handle on that cluster. The good news is, as we've scaled out our queues, kind of more or less maintained our throughput. With the single broker, it starts dipping, and dipping, and dipping, but we're able to like maintain the kind of total throughput the cluster is capable of much better.

What happens when you don't connect to the local? I cheated again. I'm sorry. This top one is when you just randomize the node that you connect to.

This here, on the left, is when I connected always to the local broker of the queue, for just the consumer. That's when I don't. This is something where the sharding plugin has a bit of an edge because the sharding plugin will actually assign you a queue that is local. That is actually something where the sharding plugin has an edge.

Confirms and acks

What do we got here? Confirms and acks. This is when we connect to the local broker with confirms and acks. As we add queues, we just keep getting the benefit. Again and again, when we're using the confirms, because there's a lot more work to be done, we always seem to be benefiting when we scale out.

Look at that. That was with one broker. This is with three. One, three. One, three. The smaller server definitely sees an improvement when we use in a cluster.

Sharding plugin

The sharding plugin. The top, that was what a single broker looked like. At the bottom, well, it's a bit different now because, with the sharding plugin, it will deploy a certain number of queues per node. If we say, one queue per node, we've got three nodes, then we'll have three queues. If we say three queues per node, we’ll end up with nine. We’ve got 3, 6, and 9 queues and we see that it remains pretty flat. We've reached some kind of limit, some wall, but we're able to maintain that as we scale out.

With the top broker. That's the number of queues. We’ve got one, two, three, up to 10. But, on this one, we've got a number per node. We’ve actually got 3, 6, 9, 12, all the way up to 30 queues. And so, on a cluster, we actually see that drop a lot later. It's really when we hit about 27 queues that we actually see a big drop.

Just to go back to the sharding plugin and consistent hash exchange, just take into account there are limits. There are optimum numbers of queues and measure with PerfTest.

Multiple replicated queues / Mirrored queue

What about replicated queues? The mirrored queue. Let's have a look. Throughput, with no rate limit, where we'll be using confirms and acks because there's no sense in not using them. You, obviously, care about your data. Without confirms and acks, you will daily lose messages.

What’s it capable of? When we’ve got a replication factor of 2, we're able to hit almost 50,000 messages. When we move that to a replication factor of 3, it goes down quite dramatically.

What about when we have a five-node cluster? Looking pretty good. Almost hitting 120,000 messages a second, with mirrored queues, which is not bad. And then, as we increase the replication factor, that goes down, and down, and down. In the end, we've got five nodes. All five are involved in every single message. We get no scaling but we do get safety. That's the tradeoff.

This is with more queues. We had fewer queues. This is with 500 queues. We see, again, that it's less efficient. When we have many more queues, we get lower throughput. It definitely prefers less queues than 500. But when we add more brokers, it is much better. We’re definitely seeing that scaling out a cluster, when we use mirrored queues, seems to help.

Quorum queues

Quorum queues. We saw that a single quorum queue tends to outperform a mirrored queue. But what about we add more and more of them? Here, we’ve got one. We’re seeing their very respectable performance. We had two, it doubles. Then, we got three and four. And then, we stopped seeing any benefit. We're not scaling out any further. You'll also notice that we're not getting any benefit from our 36 cores. We're paying all that money but we're not getting any benefit.

What do they both have in common? They've got the both the same disk. They’ve got an iO1 SSD with 10000 IOPS. Actually, were hitting the disk limit because quorum queues are writing everything to disk. And so, it really doesn't make sense to be having all those cores anymore. Unless, of course, we scale out our disks enough that we might be able to benefit again from the cores.

We got less queues here. We saw there seemed to be about four. It was hard to tell. Here, we had a queue, a queue, queue and we see it is around four queues that seems to reach the limit on these particulars. This is all just one example, so don't go away thinking that every single installation will have this same behavior.

Almost at the end here. What about five brokers? We've got a replication factor of 3. We're hitting around 60,000. That's a bit lower. We're about 75,000 for our mirrored queues, but that was 75,000 with our big server. Our smaller server managed about 40,000. We actually see that it's actually more cost effective with our smaller server because we actually get a higher throughput with the smaller server type, but we get lower throughput with the larger server.

And then, with a replication factor of 5, again, throughput has dropped even more We can see that, when we do a 10-queue test, it's about four or five queues. On the five, it's about three queues that we reach our limit.

Again, we've got 500 queues in the end here. We reach our limit at about 150. It does get rather flat. You’ll notice a little dip there. You might see another one here. Can anyone guess what that dip is about? It’s memory. Quorum queues, I didn't put any resource limits on these queues and, by default, it stores everything in memory.

What happens when your queue gets bigger? Your memory gets exhausted. Memory alarms kick in and your publishers get throttled. This is where it actually got throttled. And you see the big drop. And it got throttled on the other one as well. The bigger server that had 30-- or, I think, had 70 gigs of RAM, it did not get throttled at all.

Quorum queues, it depends. You can get high throughput, the mirrored. Potentially, you might be able to scale out mirrored more but it all depends on the server and the workload. I've managed to demonstrate that, depending on the server, I can actually make quorum queues perform better or worse than mirrored. 20 to 100 queues was the optimum for mirrored and 3 to 4 with quorum.

What do I want you to take away from this? We've seen a bunch of different things. The point is who could have predicted everything we saw then? Who could predict all the things? With all the different combinations and features, it's very, very hard to predict. You've got different hardware. You've got different everything.

And so, you really need to use PerfTest. Use the new observability features to work out how to best choose your server type, the configuration - everything, because there's a big difference in cost. If you over provision, you might be spending 3000 a month when, really, you could’ve done with maybe one c5.2xlarge for 200. You just don't know. You need to work that out and measure it. Don't forget, the awesome new observability features to help you measure.

Questions from the audience

Q: A very quick question. What was the message size in the last quorum tests?

A: It was 1 kilobyte.

Q: This is the very last question, I've been told. Jack, how can people reproduce similar tests for their workloads, have all of those nice charts and, hopefully, come to some conclusions and improve their infrastructure?

You can do everything I did there in PerfTest. Mine was slightly modified because I just wanted a few more metrics that PerfTest doesn't currently have, but we will be adding them soon, like the queue counts, and queue consumer counts and stuff like that. I'm not saying that you get that now on the new dashboards. You can pretty much do everything you need through the new Grafana dashboards. There's one specifically for Perftest. You can see everything perf test is doing. You can see everything your server is doing. It's just a matter of learning PerfTest. PerfTest sometimes is a bit hard to configure. You just need to spend a little time learning how the command line arguments - all the arguments work but it’s worth it.

[Applause]

Follow @cloudamqp

Feature complete: Uncovering the true cost different RabbitMQ features and configs

Feature complete: Uncovering the true cost different RabbitMQ features and configs

What is performance?

So many features…

Our Hardware Configurations

Exchanges

Throughput

Spot the exchange…

Fanout Exchange

Latency

Consistent Hash Exchange

Confirm’s and ack’s

Content Header Frame

Scaling Out a Queue / Sharding Plugin

Competing Consumers vs Scaled Out Queues

Latency

Single Queue

Slowly increasing message size / Throughput

Slowly increasing max in-flight limit

Acknowledgments and prefetch

Ack every message

Multiple flag

What is the ideal ratio?

Lazy Queue

Normal vs Lazy / Memory Usage

Single classic queue summary

A Single Replicated Queue

Quorum Queues

Message size

Single replicated queue summary

Multiple Classic Queues

Optimum number of queues for total throughput?

Confirms / Acks

One broker, multiple classic queues summary

Cluster / Multiple Classic Queues

A cluster, multiple classic queues summary

Consistent Hash Exchange

Confirms and acks

Sharding plugin

Multiple replicated queues / Mirrored queue

Quorum queues

Questions from the audience

Enjoy this article? Don't forget to share it with others. 😉

Elin Vinka

Free Ebook

CloudAMQP - industry leading RabbitMQ as a service

13,000+ users including these smart companies