In this talk from RabbitMQ Summit 2019 we listen to Gerhard Lazu and Michal Kuratczyk from Pivotal.

RabbitMQ exposes metrics and healthchecks that answer all questions. Future RabbitMQ versions will make it easy to visualise and understand what happens under the hood. Join us to learn about the future of RabbitMQ observability.

Short biography
Gerhard Lazu ( Twitter, GitHub ) is attracted to systems that don't work as well as they could/should. The idea is to break everything apart, understand how everything works, then put it all back together better than the original.

Michal Kuratczyk ( Twitter, GitHub ) is the product manager for RabbitMQ for Kubernetes at Pivotal. Before he became a product manager, he was an engineer building and operating distributed systems and cloud applications. Michal has been a happy user of Prometheus and Grafana for a few years and helped numerous companies improve their monitoring using these tools.

Observe and understand RabbitMQ

Michal: This is Gerhard, engineer on the Core RabbitMQ team and the mastermind behind many of the things we're going to talk about.

Gerhard: I'd like to introduce Michal he's the RabbitMQ for Kubernetes PM. If you have any Kubernetes questions, he's the person to ask. The one thing which I would disagree with you, Michal, on that previous note is that Michal is the mastermind behind this because he gave me the idea of Prometheus and I took it really far. A couple of years after that thing happened, this thing is happening now which I'm very excited about. Michal will do an intro and then I'll follow up with the rest.

Intro

Michal: Thank you. We have a lot of slides that are graphs, or dashboards. Basically, may be hard to see on a relatively small screen here. If you want you can go to gerhard.io. This presentation is available there so that you can look at your laptop, if you prefer.

RabbitMQ management limitations

All right, so what was the problem with the previous solution? Until now, basically, the management plugin is that de facto tool to manage and monitor RabbitMQ. Unfortunately, it runs on the same Erlang infrastructure together with RabbitMQ. Therefore, it is so tightly coupled that, if you have a problem in your Rabbit, then the management plugin probably has a problem as well.

Management not loading

For example, here is a well-known dashboard and you're trying to reload the page on a heavily loaded cluster. It takes roughly a minute just to load the page. That's obviously not good enough. You cannot debug a system like that. What's worse, each node is responsible for storing its historical metrics which means, if you restart a node, you lose that history. If you restore the whole cluster to solve a problem, you basically can no longer perform a post-mortem because you just deleted all evidence.

Management API response

You can use an external system to monitor RabbitMQ and it will solve the historical data problem but the performance issue will probably still there because pretty much every external system we rely on the management API to collect the metrics, so the performance issue is still there.

All right. After a minute, finally, we see some flow. We actually don't see because this screen resolution issue’s [laughs]. The management plug-in provides the UI as well as API to collect the metrics.

Here, for example, we measured how long it takes to query RabbitMQ Exporter which is a community plugin, that has Prometheus support but relies on the management plug-in to collect the metrics. As you can see, in this example, 36 seconds just to collect the metrics. That's not good enough. You can only collect the metrics like once a minute in this scenario. Even though if you look at the minimum time, it's just 82 milliseconds. If the cluster is not busy, it's actually very fast. And just here, the problem here is not the exporter, the community plugin. The problem here is the management plugin that actually needs to return the metrics.

New built-in plugin

In 3.8, we have a completely new Prometheus support provided as a built-in plugin. All you have to do to get started is you need to, obviously, upgrade to 3.8. You need to enable that plugin. That will add a new listener. You can see that Prometheus exporter at the end and that will give you an endpoint that Prometheus can scrape to collect the metrics.

Enable built-in plugin

How does it compare in terms of performance? This is the same cluster. The same time. You can still see the 36 seconds time for the exporter, for the community one, and just 300 milliseconds for this new exporter which is obviously much, much better.

An important note though is that the old, the management plugin, provides you a single end point to collect the metrics for all nodes in the whole cluster. With this new plugin, you need to scrape the metrics from each node individually which is a good thing but you just need to remember that. That's why we see three different lines here.

New RabbitMQ-Overview

Now that we have a new reliable source of data, we also needed a new way to visualize that data. For that, Grafana is a perfect tool for the job. There's this RabbitMQ-Overview. We created a bunch of dashboards that you can just import. This is the RabbitMQ-Overview dashboard which roughly shows the same data you would see in the management UI, but you also get the whole power of Grafana and Prometheus. You can customize this dashboard. You can add your own graphs. You can set thresholds. You can define alerts and so on.

Spot queue imbalances

Because we collect all metrics on a per-node basis, it’s very easy to spot imbalances within the cluster. For example, this graph shows the number of queue leaders per node. As you can see, there are 10 unknowns here and no leaders on the other two nodes. This is actually a very common problem because, by default, when you declare a queue, the node you are connected to becomes the leader. So, if you had one client declaring the whole topology, you will often end up in a situation like that. There are some workarounds but even if initially you avoid this issue, after running an upgrade or some other maintenance operations on the cluster, you may end up in a situation like that.

RabbitMQ - queues rebalance all

For this, we now have a new command which you can just execute on any of the nodes to rebalance and reshuffle the queue leaders between the nodes. If you look carefully, you can spot the moment where we executed this command on this cluster.

Evenly spread queues

All right, but writing queue relies on Erlang for clustering and communication between the nodes. To really understand what's going on, you also need to monitor Erlang. This dashboard is exclusively focused on Erlang distribution.

See node links, data sent to peer

For example, you can see whether the links between the nodes are healthy. You can see how much data is being transferred between each pair of nodes and much, much more.

Data in port driver buffer

For example, here, we can see data in the port driver buffer which likely doesn't tell you much. But, first of all, having graphs like that, if you see a spike like that around the time you had an issue, it's probably worth investigating. Maybe it's not the cause, maybe it's a syndrome but it's certainly worth understanding what happened here.

This particular metric is actually very important. It’s something that we didn't expose previously. This is the amount of data that, in this case, is waiting on node 0 to be transferred to node 2. If there is any new message and a new piece of data to be transferred between these nodes, 120 megabytes of data has to be transferred, between these nodes over the network, before that happens. We cannot expect low latency at this point. This is a very important metric.

You could say now that we have Rabbit metrics or Erlang metrics, that's what we need, right? Well, we are still missing the most important part. What the application can actually see, right? Users don't complain about data in the Erlang buffers, they complain about the application being slow.

Perf-Test dashboard - v1

Here we have a dashboard showing the latency and throughput for PerfTest which is our internal application that we use for testing. You should basically instrument your own applications like that which is not that hard. It's actually relatively easy. Even if you don't have that instrumentation, you can still deploy PerfTest to compare the numbers that your users or developers are reporting. If they say, ”I can only produce 1000 messages a second.” You can deploy PerfTest and see whether that's the problem, whether it's a problem with the cluster or maybe with the application, with the client.

With so much data, sooner or later, you will see a piece of data that you don't understand. There'll be a graph that has some weird spikes or weird curve and you would like to understand what's going on. Therefore, for many of the graphs, you will see this I icon that provides additional explanation of what this metric actually means. It will link you to the relevant parts of the documentation.

If that's not good enough, if you still have a problem, you still don't understand, you still can't solve the issue, Grafana has a great feature which allows you to share a snapshot of your dashboard. Instead of taking a screenshot that we need to then work with, you can actually expose a subset of your monitoring data to someone on the mailing list or with Pivotal. Then, we can actually work with the data and not a screenshot of a dashboard.

Upgrade to RabbitMQ 3.8

All right. First of all, upgrade to 3.8. There are many amazing features waiting for you. It's technically even possible to run the management UI without the metrics subsystem. It's not a very beautiful dashboard when you do that but we will deal with that later.

Enable and visualize new metrics

Please enable and visualize these metrics. That's the first step towards having factual data-driven conversations in the future.

Share metrics when asking for help

When you ask for help, please help us help you and share these metrics so that we can actually understand what's going on.

With that, I will hand over to Gerhard who will show us even more awesomeness.

How to understand

Gerhard: Thank you for that introduction and for warming the room up. I really appreciate it. Thank you very much.

We talked about the limitations within the previous RabbitMQ versions. We also heard about the new metric system in RabbitMQ 3.8. But, all of this would be for nothing, if there wasn't a higher goal. We're not trying to show off here. We're not trying to play with colors. We're trying to understand something. We're trying to get you to understand something and to have different conversations.

What happens in the Erlang Distribution?

One of the things that keeps coming up or has been coming up over the years that I'd been in the RabbitMQ community was, “What is happening when we're using mirrored queues? What is happening at a cluster level? What happens between the nodes?” Also, this is something which keeps coming up is, “What's happening with the quorum queues? How are they different? Can you tell us? Can you show us the benefits?” There is one very obvious benefit which I'll show you in the next couple of minutes.

Mirrored Classic Queue

This is a typical mirrored classic queue. It has one leader and two mirrors. This runs in a three-node cluster and we're publishing 100 messages per second, not high throughput, something fairly standard. The messages themselves have 10 KiB. This is important. One publisher, one consumer, so a very simple set up I would say. The thing to remember here is that there is a constant message flow, so we publish and consume messages constantly. And the throughput, both in and out, is 1 MiB per second.

You're used to seeing this, the RabbitMQ management. Okay, I can see my rate. Unfortunately, this is slightly cropped. We can see, at the bottom, the details where we see that. The leader is running on RabbitMQ 0. The first follower is running on RabbitMQ 2. And the second follower is running on RabbitMQ 1. We can't see it on this slide but trust me that's what's happening. You can check this online to see it.

Why does Erlang distribution traffic matter?

You've already seen this dashboard. This is the Erlang distribution dashboard that shows you what is happening from an Erlang perspective - low level. All communication goes via the Erlang distribution. This is a view into the Erlang distribution. The thing which I would like to point out here is that these will show you the state of the various links between nodes.

Everything here looks healthy. All the links are up. There are six links. A RabbitMQ cluster is a complete graph which means that every node connects to every other node. Because we have three nodes, we will have six links.

But the thing which I would like to point out in this graph, other than that and if you could see the top, is the data sent to peer nodes. Remember, we have 1 megabyte coming in and 1 megabyte going out. But what's happening at the cluster level, RabbitMQ 0 is sending 2 megabytes to RabbitMQ 2, which is the first follower. RabbitMQ 1 is sending 1 megabyte to RabbitMQ zero. And RabbitMQ 2 is sending 1 megabyte to RabbitMQ 1. Also, RabbitMQ 0 is also sending 1 megabyte to RabbitMQ 1.

Why does Erlang distribution traffic matter?

What's all this traffic? Where is all this traffic coming from? We have the leader mirror, 2 megabytes. Mirror 1 to mirror 2, 1 megabyte. Mirror 2 to leader 0, which completes the ring. This is a chain replication. You can see the effects of the chain replication from a network perspective. We can also see that leader 0 is publishing messages to mirror 2. The reason why this happens is because not only your messages will be going around the cluster, you will also have the channel which will publish the same messages into the mirrors. The reason why this happens is you need to publish messages twice, in this case, because if something fails you have another copy of the message around. What that means is that, for the 1 megabyte coming in, for the 1 megabyte going out, you have 5 megabytes going around the cluster. That's a lot of data, okay.

Why does this matter? I mean, I just told you all these numbers. But why does this matter? The reason why this matters is because there is an absolute limit that the single Erlang distribution link can push. And this limit, for 10 kilobyte payloads, is actually 470 megabytes per second. This is the absolute limit that a single link can publish. This is OTP 22. So, if you're using a previous version, it will be better if you're using a next version. Now, that's interesting.

Erlang distribution link limit

So, we showed that graph, that you've seen before, to the OTP team, to the Erlang team. We told them, “Hey, we've discovered this. What do you think about this?” And they said, “Well, we knew there was a problem but we didn't expect people to think or you come across this limit. It's just basically how much data we copy around. The way we copy data is not as efficient as it could be.” Six hours later, Rickard Green did a PR which improves this, so that one graph helped the OTP team improve certain things in Erlang OTP itself, in the distribution itself and how they just copied and transferred around. This limit is now higher.

Quorum Queue

Quorum queues. We've heard a lot about quorum queues. How are the quorum queues different from classic queues - from the mirrored classic queues? Same set up. Same thing happening in the cluster. This is the difference. The difference is that we have only 1 megabyte going from the leader to one follower, and from the same leader to another follower. That's it.

Quorum Queue 2.5x less pressure

In summary, the quorum queue is putting 2.5X less pressure on the Erlang distribution because of the much better algorithm. The replication algorithm that quorum queues use is so much better. This is one way in which we can see that. From a network perspective, you will be transferring less between nodes using quorum cues doing exactly the same thing that mirrored classic queues are doing.

Help me understand Quorum Queue internals

We've seen the Erlang distribution. Maybe it was a bit more than you wanted to see. A bit too deep dive-ish but that's okay.

Now, we look at the RabbitMQ quorum queues, the internals. Again, to most of you, this may not mean a lot now but as you start using them, I think that you will want to know what's happening under the hood. Maybe someone in your team will have experience with Raft, or Erlang, or different things that can see this and they can say, “Oh, hang on. Actually, I can tell you what's going on.”

Log entries committed/s

We can see the log entries that are being committed. We've heard Carl talk about the log. We've heard how that works.

Here, we can see how many log entries are being committed per second. We can see a nice explanation of what that means and what is good and what is bad. You can have a real-time view into the log operations on a node-by-node basis. If a node lags behind or if a node is ahead, it's good to know. Not only that, but another thing which again will come up, low latency was mentioned. How long does it take for a log operation to be written to disk? This is it. We can see roughly a distribution of the log that’s being written to disk and how long they take, with the explanation.

Here, at the bottom, we can see something very interesting. I won't cover all the dashboards. Here, we can see members which have logs and how they get truncated, so we can see like those lines going up and down. But, we can also see here a member for which the log operations keep going up. This is something worth watching out for because, eventually, you will hit a limit.

So, in this case, maybe there's no consumers and you have log operations being appended. Or, maybe you can have a consumer which keeps rejecting messages and the log keeps going up, and up, and up. It can't be truncated. This is what it looks like. Many reasons but it's worth knowing when this happens.

Hey, Rabbit! Where is my memory?

I think this is possibly one of the funniest questions to answer, from my perspective, and I almost gave a talk at this conference about this. There's a talk here, I won't have time, so I'll have to go quickly through this. But this keeps coming up. “Hey, Rabbit. Why did you eat all my memory? What is going on? I want to understand. I want to see what's going on.”

Erlang memory allocators

There's a dashboard that we built for this. It's called the Erlang memory allocators. This is what it looks like. I wish we could fit everything here but we can fit enough to point the allocators by type. The binary allocator, right here in the corner. The binary allocator is the green line. The binary allocator is, on average, using the most memory. So, all your messages flowing through your cluster, through your RabbitMQ, will be using this allocator to allocate/deallocate things.

We can see a breakdown of all the different allocators. We can look at specific allocators. What's being used. What's not being used. How it's being released to memory.

To an Erlang expert, this is gold. Why? Because they can really understand what is happening and they can dig deeper. To most of you, it may not make a lot of sense but it's very helpful when you need it.

Okay, we have eheap1 processes which use too much memory. Again, this is fairly stable so this is a good RabbitMQ cluster and a good RabbitMQ node.

We can see also ETS. ETS is Erlang Term Storage, all in memory. The RabbitMQ metrics are actually stored here. If you still use the RabbitMQ management and you query it, a lot of data will show up here. Especially, if you have lots of objects. The new system doesn't use this, so it will put minimal pressure on ETS. The other thing which puts pressure on the ETS is quorum queues. That's what gets used for the logs. So, this is one to watch.

Discover other dashboards

You can discover more dashboards. You can go to Grafana.com. We have an org, RabbitMQ, who already published a couple. You can check them out. You can also make your own dashboards. A lot of the cutting-edge stuff that we haven't published yet or in the process of publishing are in the RabbitMQ Prometheus repo. There is a make target that you can run. It can generate an import-friendly dashboard and you can import in your Grafana.

More CLI power to you

Now that you get a sense of all the new things, you can start understanding, and all the new conversations we can start having, let's discuss about observing RabbitMQ from a different perspective, from the CLI. I think most of you here are familiar with htop. It's a tool which I really like to use. It's a bit like Top but slightly more colorful. It gives you a better idea of what's happening on your system. In this case, we can see memory, CPU usage, different threads. It's really helpful to get a quick idea of what's happening in your system.

Observe Erlang VM / RabbitMQ v3.8.0

In RabbitMQ 3.8.0, RabbitMQ Diagnostics Observer, if you run this command, you get the equivalent of htop but for the RabbitMQ node. This was actually backboarded as far as 3.7.15 so you will have it for a while. but what we are suggesting is to go to 3.8.0 because there are a lot of good improvements in there.

Now, unfortunately, we can't see all of this but it will. In a minute-- there we go. This is what it looks like. If you run the command, this is what it looks like. This is a single Erlang VM. So, very similar ncurses based. For example, here, we can see all the processes, all the Erlang processes running on the node sorted by memory. Very useful. As well as schedulers and many, many other interesting things.

We can also sort processes on a single node by number of reductions. We can see how hard they work. We can also go in, for example, inspect a specific process. “Hey, what's happening in that quorum queue process?” All in the CLI, very easy, just type some commands and we can inspect state. This may be, again, for most of you, may be overkill but it's good for those that need it, that you can have it so easily.

The other thing is we can, for example, look at ETS tables. What's stored? What’s taking the most space? All in your terminal. You don't even need RabbitMQ Management. You don't even need to go to UI. It's all in the terminal.

See RabbitMQ events

Another thing which many people are excited about, and I'll skip to the next slide because this is an interesting one, is where you can see events coming from RabbitMQ. There's a new CLI command that you can run and you can stream events that are happening in RabbitMQ - all the connections or disconnections, different stats. It's a good way of understanding what's happening in RabbitMQ, at the CLI. These are events so you can stream them wherever you want. If there is a system that's good for streaming events, send them there. Analyze them. Dice them. Slice them. This is very powerful.

But that's not all. There are so many other commands. There's like a whole section on diagnostics and observability in RabbitMQ Diagnostics. You can tail logs. You can see run time thread stats. So many things you can see in the CLI, not needing any external system, just RabbitMQ itself and a terminal.

WHAT HAPPENS NEXT?

How much time do I have? A lot of time? Excellent. I think we'll do a demo. Yes. I wasn't planning for this. I like this.

Anyway, so let's talk about what happens next. what's going to happen next? We have a couple of things in mind. The first thing which was already mentioned a couple of times, we would like to see more metrics. We already expose a lot of metrics but some of the users that are used to using, for example, RabbitMQ exporter, which is a community plugin that works with RabbitMQ to get metrics exposed in a Prometheus format, exposes a lot more metrics. And there are certain metrics that people would like to see that we don't have yet in RabbitMQ Prometheus.

Expose more metrics

We would like to expose more metrics. While we do that, we would also like to address the high cardinality. What that means is that RabbitMQ can expose a lot of metrics. This is something that Rabbit itself has to struggle for many years where there's a lot of data. What do we do with the data? A lot of metrics. It will take up memory. It will take up disk.

If you want to slice and dice your data in all the fun and interesting ways, you will have a lot of metrics. How do you handle that? So, you want metrics. And, if you have metrics, there's a problem that there's too many metrics. This is like a catch-22, almost like you need them and you don't have them. It's very complicated, but we're trying to do the best we can and we're trying to address some of the limitations while still offering good metrics, accurate metrics, detailed metrics.

Build new dashboards

We'll be building more dashboards because we know that we're only just getting started with this. A lot of users have been asking about queues, “Okay, but how can I see my queue state?” That's something worth having, for sure. In this case, we would like to have-- it's already PR waiting, we'd like to understand what is the Erlang microstate doing under the hood. That's a very good one to have, to understand where the CPU time, from an Erlang perspective, is going. When your RabbitMQ spinning its wheels and you don't know why, these are the metrics that you need to have.

See logs together with metrics

Very important, can we see the logs? It's okay to have metrics but can we also see the logs? Yes, we can and we would like to integrate logs with metrics. There's a system that we can use. It's also part of this landscape. It works well with Prometheus. It works well with Grafana. We can get all the metrics in the same place and we can see logs alongside metrics. Very helpful when we're trying to dig down into issues.

Try open telemetry

There is this thing called open telemetry. It's a part of the CNCF landscape, (the Cloud Native Computing Foundation). There are some other projects you may have heard of. Kubernetes is one of them. Prometheus is another one of them. All these projects are part of this landscape. Open Telemetry is one such which is focusing on exposing metrics, on exposing logs.

Last 3 Takeaways

Now that we got towards the end, what's worth remembering from all this?

Have different conversations

The first thing which is worth remembering is starting to have different conversations. It's not so much about metrics. It's about having the data that's required to have different conversations. I mean, people wouldn't tell us that something is wrong and this is like a log snippet or this is like a screenshot - that's what we actually get. It's really difficult to help those people because we just don't have enough details. So, having these metrics and having the tools needed to understand what is happening under the hood, we can have different conversations. We can actually help people rather than tell them to upgrade, which is what will happen a lot of the time because that will fix some issues. It may fix your issue as well, it may not, but that's the best that we can suggest. Everything else takes so much time and we don't have that much time to dig and ask for logs, and metrics, and screenshots. It can take weeks to get to the bottom of anything. And it's very, very wasteful on everybody's time.

Aim to understand and improve

The goal is to understand and to improve. The goal is not to, “Look at my metrics. I've arrived. I'm exposing every single metrics.” The goal is to understand what is happening under the hood. And one of the things which are traps which we could fall into is exposing too many metrics, overwhelming you. That's why there isn't one dashboard that we could give you that would be good enough because it will either have too little or too much. That's a real risk while we do this and that's why we try to be conscious of, “What are we trying to show in this dashboard? What story are we trying to tell? What dashboards are we missing?” rather than trying to cram many, many things in one. It's just impossible to do that.

Also, you can't just get there. It's not a matter of, “Okay. I'm 100% observable.” There's no such thing. Also, there's no such thing as, “I'm more observable than system X or product X.”

We will improve RabbitMQ by doing this. We already have improved, not only RabbitMQ, but the Erlang. We've seen that pool request which is a direct result of a dashboard. We have improved Erlang OTP by having the right graph, the one graph that shows what the problem is. Things are already improving. Things are getting better. And it's not just us, it’s everybody around us.

Ask questions in the webinar

Now, I thought they wouldn't have time for questions, but apparently they do. But, anyways, if you have deeper questions or better questions that require longer answers, you can ask them in the webinar. On the 12th of December, we will go through this one more time, but from a different perspective, talking about operators, about developers, answering specific questions on how the new metric system and new observability tooling is helping answer those questions.

You can come and join. You can follow the RabbitMQ on Twitter to see when the webinar will be out, sign up for it, so on and so forth. And you can ask any questions you have - the longer ones, in that webinar.

Questions from the audience

Q: A little bit off topic but, maybe you can expand a little bit on the log aggregation part?

A: Okay. For log aggregation, we were going to look at Loki. Loki is a tool part of the Grafana labs and the Grafana tool set. You have Grafana which visualizes metrics. You have different data sources. One of the data sources which you can have is Loki. Loki is a product, self-hosted. You write somewhere. You send all the logs to Loki. And then, one of the data sources in Grafana would be Loki. Loki's all about showing you logs.

Another services that you can use for logging is, for example, Stackdriver which is also about logs. I think there's also Elastic. And, possibly, Splunk. The nice thing about Grafana is it doesn't lock you in to a specific data source or to a specific provider. You can mix and match them, choose whatever makes sense.

Right now, we're only using Prometheus, open source product which scrapes all the metrics. All dashboards that we have today are only using the Prometheus data source to show the graphs. Loki would be used, in my opinion, for the logs. Open-source, hosted, same as Prometheus.

Q: Will those new Grafana dashboards work for the old Prometheus exporter plugin as well?

A: No is the short answer. The longer answer is they could, so we could adapt them.

Just to give you a bit of history. When we started looking at Prometheus support, we started by working on existing plugin, so the Prometheus RabbitMQ exporter,

We go to a point where we couldn't improve it anymore by making some big, big changes, to the point it would be a different plugin. We thought, “How about just taking the best bits and doing them in RabbitMQ, so that you don't have to download this other plugin, set it up, then have dashboards which is more difficult to work with that.” So, rather than doing that, we build something that's a slimmer version, that has also dashboards that go with it.

And because we control the plug-in and dashboards, it's much easier to keep them in sync. Why? Because if there was someone else's plugin and we need to be off PR, so it’d be a bit more difficult.

Now will it work with the regular RabbitMQ exporter (which by the way is another plugin)? Same answer. No, we wouldn't. But we would like to maybe improve our plugin, the RabbitMQ Prometheus plugin, to expose similar metrics that the most popular RabbitMQ Export exposes as a transition period. So, that would make sense.

Q: With classic queues, we had some statistics earlier, there’s a lot less network pressure there. I haven’t really seen it posted as a feature of quorum queues. I'm not seen it posted that you can get more data through it. Is there a reason why that’s not necessarily like a headline feature?

A: You can. You can get more data through it. I think, in our test, when we did them when we when we tested, quorum queues are roughly twice as fast as mirrored classic queues. But the reason why we don't post these statistics is because it's all workload-specific.

So, twice as fast for what workload? Like, do you have a constant stream of messages? Are they like of the same size? How many producers? How many consumers? Are the producers and consumers going as fast as they can? There are so many variables, it will be very difficult to say a default like, you know, it's like-- but on average, quorum queues are faster. They're doing less work. They also tend to use a bit more memory. Some people don't like that because RabbitMQ and memory, the relationship hasn't always been the best. It's getting so much better now. And it will only get better. But quorum queues do use slightly more memory.

Now, between 3.8.0 and 3.8.1, quorum queues use 25% less memory, as you see in the keynote. That's just a sign of the improvements which we hope will come. Over time, as they mature, because they've only been launched like one month ago. As they mature, there will be more performing, more efficient, more mature and better for a lot of things.

There are still workloads for which you should not use quorum queues. Good thing we have a good documentation page which explains them and when you shouldn't use quorum queues. They're not like a silver bullet.

Q: I was just wondering. I'm sure I remember some user was using Influx and there was a way of using Telegraph to scrape? I was just curious because I know a lot people use Influx. They don't use Prometheus, so I was wondering.

A: Thank you, Jack, for reminding me something very important. The reason why we chose Prometheus is that it's not a certain vendor and it's not a technology that we particularly like. That was not the main reason.

The main reason was that the Prometheus format, the metrics, is becoming the open metrics format. The different monitoring solutions are standardizing on what metrics should look like. They're called open metrics. If you go and check them out, check open metrics out. It's part of the CNCF landscape. What it means is that everybody can use the same format, whether it's Influx, whether it's a DataDog, whether it's Prometheus. It doesn't matter who you are, the format is standard, just as HTTP is standard and AMQP 0.9.1 is standard. It's a format in which we expose metricsTO and we don't care which tool you use.

So, the Prometheus format is more than just for Prometheus. You can use Telegraph, connect into your RabbitMQ Prometheus endpoint, and then you can get your metrics in InfluxDB. Same thing for DataDog. Whatever you're using - software-as-a-service, open source, you can do that. It's a format which is becoming a standard in the cloud native computing foundation landscape.

Follow @cloudamqp

Observe and understand RabbitMQ

Observe and understand RabbitMQ

Intro

RabbitMQ management limitations

Management not loading

Management API response

New built-in plugin

Enable built-in plugin

New RabbitMQ-Overview

Spot queue imbalances

RabbitMQ - queues rebalance all

Evenly spread queues

See node links, data sent to peer

Data in port driver buffer

Perf-Test dashboard - v1

Upgrade to RabbitMQ 3.8

Enable and visualize new metrics

Share metrics when asking for help

How to understand

What happens in the Erlang Distribution?

Mirrored Classic Queue

Why does Erlang distribution traffic matter?

Why does Erlang distribution traffic matter?

Erlang distribution link limit

Quorum Queue

Quorum Queue 2.5x less pressure

Help me understand Quorum Queue internals

Log entries committed/s

Hey, Rabbit! Where is my memory?

Erlang memory allocators

Discover other dashboards

More CLI power to you

Observe Erlang VM / RabbitMQ v3.8.0

See RabbitMQ events

WHAT HAPPENS NEXT?

Expose more metrics

Build new dashboards

See logs together with metrics

Try open telemetry

Last 3 Takeaways

Have different conversations

Aim to understand and improve

Ask questions in the webinar

Questions from the audience

Enjoy this article? Don't forget to share it with others. 😉

Elin Vinka

Free Ebook

CloudAMQP - industry leading RabbitMQ as a service

13,000+ users including these smart companies