In this panel debate from RabbitMQ Summit 2019 we listen to Dormain, Jack, Carl, Michael and Ayanda.

It's time to talk about the current state and future for the most widely deployed open source message broker in the world. Selected guests takes part in a panel debate, including questions from the audience.

Moderator:

Dormain Drewitz. Senior Director at Pivotal. (Twitter)

Presentation of panel:

Jack Vanlightly. Software Engineer at Pivotal. Member of the RabbitMQ core team. . (Twitter, GitHub, Website)
Carl Hörberg. CEO of CloudAMQP / 84codes. . ( Twitter , GitHub)
Michael Klishin. Software Engineer at Pivotal. Member of the RabbitMQ core team. (Twitter, GitHub)
Ayanda Dube. Software Engineer at Erlang Solutions. (Twitter, GitHub)

Panel debate: RabbitMQ and its future

Introductions

Dormain Drewitz: My name’s Dormain Drewitz. I'm with Pivotal. I've got a couple questions here to sort of get things going. And then, if we've got some time, hopefully, we can open it up to the room.

Jack Vanlightly: I recently joined the RabbitMQ team, the core team. I started with Rabbit about four years ago just as a developer, but got really into Rabbit. Then branched out into Kafka, so I became a RabbitMQ and Kafka specialist in consulting. And then, doing community testing in Rabbit and got to know the team. Suddenly, they decided, “Oh, he’s worthy of joining the team”. I’m not so sure but we’ll see [laughter].

Carl Hörberg: I'm Carl. I've been working with RabbitMQ for 10 years. Seven years ago, I created CloudAMQP.

Michael Kilshin: I'm Michael Kilshin. I'm a developer on the core team. I have been contributing to the first client libraries since 2009 - 2010, something like that, and RabbitMQ itself since 2013.

Ayanda Dube: I'm with Erlang Solutions. I've been working with RabbitMQ for four years now. I started off working with just-- well, a team from Erlang Solutions was working with Pivotal. So, I used to work with Michael a lot, and Diana. I'm now team leader at Erlang Solutions for RabbitMQ.

What makes RabbitMQ still relevant today?

Dormain: All right. We're going to start off with kind of a big, broad question about, you know, there's a lot of talk about cloud native architectures and technologies these days. So, what makes RabbitMQ still relevant today? And what makes it cloud native or not?

I'll pick on someone if no one jumps in.

Michael: All right. Well, I'm sorry, but I don't really understand what cloud native means. Pardon my ignorance. But, if we assume that it means (a) automation friendly, (b) available in all kinds of cloud, for lack of a better word, infrastructure product, you cover more than anything. Observable. Again, not everyone agrees on the definition but-- and yes, there is some work to do around scaling the various workloads. Yeah, we improved a lot of things since even 3.6. I think, in three out of four areas, it's there. There's also an ecosystem of projects, all kinds of tools Java, Reactor, Python libraries. So, there is an ecosystem on top of that.

For a project, which is almost 13 years old, I think that's pretty cloud native. But that's only my definition.

Ayanda: Yeah. I think, based on the technology that's used to implement Rabbit. So, looking at the Erlang distribution protocol, I think when you look at the cloud, some of the services which cloud providers promise, basically, ease of scale, spinning up a RabbitMQ cluster is literally seamless. It’s very easy. I think that’s the power of Erlang. So yeah, I think the backbone of Rabbit still makes it very relevant today in terms of cloud environments.

On use cases, server list, microservices architecture

Dormain: Anyone want to chime in on use cases, server list, microservices architectures?

Jack: Yeah, I mean, Rabbit is just extremely flexible. There are other messaging systems out there but there's none that are as flexible as RabbitMQ. You can get ones that are better at scaling. But, as we’ve seen from the CloudAMQP’s talk, most people don't need a million messages a second. They’ve got far fewer. And they actually benefit a lot more from the flexibility of RabbitMQ because you can go with Kafka but suddenly you've got a lot more complexity to deal with. Whereas, with RabbitMQ, you get such great decoupling between your publishers and consumers. It makes a lot of your architectural decisions simpler. It makes changing your architecture more simple. And, with Cloud AMQP, you've got it in the cloud so you can take it with you in any cloud migration you go to. There’s loads of automation solutions. It's coming to Kubernetes, so it’s still extremely relevant.

Carl: Yeah, I mean, we have a couple of thousands of customers with different use cases. It's everything from IoT, implementations, it's to financial firms. It's so varied, the amount of cases we're seeing. No one is doing the same with RabbitMQ. It supports so many different use cases. Everything from high throughput, to many connections, to very complex routing of the messages. It's much more versatile than many other message queuing options.

What are you most excited about in 3.8?

Dormain: We heard an update this morning about 3.8. It’s been available for about a month. Of course, I’m sure everyone here got that high on their radar and they've taken a look at it.

Each of you, if you had to pick one thing in 3.8 that you're most excited about - and there may be some overlap here, that's okay, you don't need to all come up with something unique, what are you most excited about that you're already playing with or looking forward to using more? Jack, we'll just start with you and go down the line.

Jack: Quorum queues, or single active consumer, or the new observability features. I mean, quorum queues is really cool. We’ve had mirrored-- or ha queues for a while now. Most people have been bitten by it at some point. Your replication protocol shouldn’t actually make it less safe than using reliable disks or something. So, I think, the quorum queues are a big improvement.

I think it's early days for quorum queues. This is the first iteration and it's just going to get faster, better use of memory, and it’s also opened the door. Now, we've got a Raft-based system for new queue types offering other replication sort of patterns.

So, I think quorum queues because it's just the beginning. We're going to have more queue types probably in the future. That’ll be mine.

Carl: One thing I'm excited about is that Quorum Queues is much more network efficient, like Gerhard talked about in his talk, I think it was.

Today, we're paying a lot of money for the intra-cluster communication, one message is published and one node has to be copied to all others, and back and forth. It's a lot of traffic, back and forth. Quorum queues solved that neatly.

And then, the other thing that the new Prometheus management’s observability, that is much more lightweight and for us to pool the management interface will also save us a lot of resources on our clusters.

Michael: Right, so I'm definitely going with monitoring, even though I appreciate a lot of work went into implementing raft efficiently. And then, adopting it in multiple places. I saw that firsthand.

One thing that was mentioned in Gerhard’s talk is that, once we introduced all those metrics and a way to easily deploy new clusters with Prometheus and Grafana, and share those things, the rate of experimentation and change just blew up, all the way to contributing to Erlang. But, there are certain things that I did not mention in my keynote. For example, in just a month, we improved-- rather, we reduced peak quorum queue memory usage by 25%. How did this happen? It's not because someone just woke up one day and like, “Hey, I have this brilliant idea.” It's because Jack had a lot of load tests that had monitoring hooked up and we have noticed certain things. Carl managed to correlate some things. Narrow it down. Try a few things. Boom, you have a 25% improvement. If we didn't have any metrics, we wouldn't have this improvement. It's hard to underestimate the power of that, in my opinion.

Dormain: Strong case, Michael. Ayanda, how about you?

Ayanda: Okay, speaking from the Erlang and Elixir community, I think, looking at the Quorum Queues use case, I think we've now got a solid development framework which is based on Raft, the Ra Library. I think a lot of developers can now take that and basically use that to build their own distributed systems.

In terms of the use case, they already have the Quorum Queues which is already in production in some systems, right. So yeah, I think from the community, that's a very big contribution from people.

Upgrade process

Dormain: Okay. Well, no one asked me. But since no one said feature flags, I'll put in a vote for feature flags.

Okay. Let's talk a little bit about that upgrade process because as we saw, with the data from the Cloud AMQP team, two thirds of that sample size are on 3.7. But, that means that there's still a third of folks out there on older versions, maybe, as you said, Anders, “ If it's not broken, why touch it?” But, there have been a lot of improvements in the latest releases and all this new stuff in 3.8.

When looking at the upgrade process, what would you highlight for folks, either that you kind of know from past upgrade processes or things that are starting to change with the new features? What would you highlight for folks to help improve that process for them?

Jack: Well, one would be, if you’re running a cluster, make sure you provision it. If you wanted to use feature flags in the future and have a rolling upgrade of your service, if you can't take the loss of one node because you're already at capacity, then it's not going to go very well. So, always make sure you very well provisioned your cluster to cope with the loss of a node.

I guess, Carl, you've probably got the most experience with upgrades will all your thousands of customers.

Carl: Oh, yeah. We see that a lot.

It's a great improvement with the feature flags that we can now do rolling upgrades between minor versions. Still can't do it between Erlang - but that’s not so critical.

Historically, we always recommend the customers to be sure to do a Blue/Green deployment, to have another cluster that they move over, to use a queue federation to move the messages to the new cluster with the new version. That's the easiest. It’s often the most reliable way to do it. Then, they can easily roll back and all that.

But our service, of course, provide automatic upgrades and we can do rolling upgrades. But, when you are at capacity, people don't realize that sometimes, then that causes some big problems. They're already using way too much resources, had been taken down a node, and then the load is moving to other nodes - the remaining nodes. And then, they get overloaded. And then, it's just a mess. And then, the queue synchronization and all that. Hopefully, that's better now with quorum queues so.

Michael: Yeah, I don't really have much to add. By the way, I completely respect the point of view that if something isn't broken, then you don't have to touch it. Just keep in mind that there is this thing called security vulnerabilities. If you don't touch a system that is vulnerable, well, you're running a certain risk. I mean, if you're a financial institution, that's probably also a regulatory risk.

In my opinion, with all of these improvements which, again, by the way, took a lot of engineering effort, you can always experiment with a Blue/ upgrade, even on a loaded system, because you're bringing up something new. You switch half of your traffic over. If something goes wrong, you still have your original system to go back to. In my opinion, that's a very low risk option.

Again, I understand that it was difficult, maybe a few years ago, where the degree of automation, various open source data services were not necessarily at the same level. These days, a new cluster is like one API call away. So, you can definitely try it. But, of course, we’d love people to adopt 3.8 sooner. I think there is substantial improvement to make that appealing.

Ayanda: Okay. I think, based on most of the customers that we worked with, one of the most common requests is, “How do we upgrade with minimal downtime?” So, I think looking at the work that's coming 3.8, being able to do upgrades with mixed version nodes. I think that's a step in the right direction. Looking at what 3.7 brought in with the rolling upgrades where nodes don't time out, based on the ordering--

Michael: Random restart order.

Ayanda: Yeah, random restart order. So, I think, looking at what each major version is bringing in, in terms of answering that question of improvements around upgrading, minimal downtime, Yeah, I think that's a really good addition in 3.8.

Performance

Dormain: Okay. So, our next question, we're going to sort of shift and focus in on a couple of particular hotspot areas. Performances, is one. What are some of the considerations or approaches that each of you take when thinking about performance and how to optimize for it? Let's start with you, Ayanda, and kind of work our way back.

Ayanda: Okay. So, I think, around performance, referring to the talk from Cloud AMQP as well. If you look at how users make use of a Rabbit node or a cluster, our approach is we've noticed that it’s becoming a discipline in itself, being able to design and optimize the usage of a RabbitMQ cluster in the right way and according to best practices for optimal performance. We've come up with some design approaches. So, designing for memory, designing for disk usage.

Yeah, so we’ve, I guess, we probably need to blog about some of these things.

Dormain: I'd love that.

Ayanda: Yeah, we’ll write some blogs around that. It's becoming, I guess, a field of focus on its own.

Dormain: Michael?

Michael: I'm not going to give a lot of recommendations in part because Jack is so dedicated to that in part because I don't know your what your workload is like.

On our team, we believe that any performance optimization work should start with metrics because even people in our team cannot just take a look at a cluster and immediately say what's going on unless it's like a single queue cluster which is a very infamous workload. RabbitMQ was not designed for single queue workloads, so certain benchmarks bottlenecked in a single core.

Once you understand that, using a node can use multiple cores, usually, if you have more connections and queues. Reasonably well, there is a dog guide on runtime tuning. You can optimize things that way.

As far as going multi-node, it's all about queue distribution and data locality. Data locality is something that, because we implement protocols that kind of exist. They are, at least, not too vendor specific. It's a good question, how do we improve data locality while still implementing protocols that a lot of vendors implement? I don't know how many of you have been on a consortium or anything but reaching consensus and changing things sometimes can be very hard. It takes a long time or be even impossible. And it's not enjoyable, the process. So, I think that's an open-ended question. What can we do to improve data locality in clients and at least some protocols given that the involving protocols is very painful?

And lastly, from quorum queues to internal communication-- well, we keep eliminating certain bottlenecks that shouldn't be there. They're just implementation details. But, all of that is only possible because we have some data to work with and so should you.

Well, if someone comes to me and asks, “Hey, how do I make my system-- how do I improve throughput here?” My first-- like I have 10 questions, basically. Tell me what your system does, and why, and what are the metrics. I cannot help you, otherwise. I think it should be telling that if people, who develop RabbitMQ, will ask for data. First and foremost, maybe your team members should take the same approach.

Carl: We used to have a lot of performance problems but, in the last couple of years, RabbitMQ has improved a lot. Earlier, it was much easier to get out of resources on the RabbitMQ cluster. Nowadays, it’s mostly they're running out of memory. They are like having millions of un-acked messages, or maybe too many connections, or with the classic queues, it’s easy to get out of memory in certain scenarios.

Data locality has been a problem. It hasn't been very efficient at it. The clients don't know where to connect or connect to any node and it might not be the leader, so the traffic has to go through a lot of nodes and a lot of traffic duplication which can be costly in public clouds. Right now, there's not much to do about that because the clients are not aware we're doing queue leaders.

But, as we've seen, most of our clients, they're not using high throughput - publishing a lot or consuming lot. Normally, we don't see the cluster is out of CPU. But if they're using some certain extensions like deadlettering and they’re doing funny things like deadlettering a whole queue at the same time for some kind of schedule mechanism and then they dump 1 million messages into another queue, then that can be a problem too because then the RabbitMQ does not have the normal credit based system or at least not used to. So then, they can easily overload the cluster. It’s often kind of hard to see that you're doing that or where the problem is. But then, you see it in the inboxes of the Erlang processes.

Jack: It's quite difficult to predict performance of RabbitMQ. Part of the problem is the number of features. We've got features that you combine with other features. You've got a bunch of exchanges that you combine with different types of queue. And you can have different message sizes. You can have hard drives, SSDs. You can have some network latency or none.

I mean, there's more configurations and feature combinations than atoms in the universe, probably. In the end, it's just combinatorial explosion of things. So, my recommendation is to use Perf Test and try things out because you might think, “Well, I've got one node now. I'll add another two and surely it will go faster.” That's not always actually the case. It might go faster, but it's not going to necessarily going to be a linear improvement. And so, you can spend a lot of money adding more servers and faster disks and whatnot, but if you're not addressing the actual bottleneck for your particular workload, then you can be throwing money away and not get any performance you expect.

So, the best thing is to use some kind of load test, a load Perf Test. Measure your specific workload and try to kind of customize based on that. And there's loads and loads of optimizations. So, you can do all over the place.

Michael: One comment on the feature of combination explosion. That's one part why we have a queue type obstruction. You can expect more queue types in the future but to cover certain use cases, but also make them narrower. Quorum queue already doesn't try to do everything that classic queues can, in part, because we would have to give up some of the gains. But also, certain features, they just don't make any sense in the context of Rabbit.

You can expect more focused queues like that. The smaller the surface is, the easier it is to optimize. In fact, maybe that’s what you start with, like you want to optimize for a certain case. So that's our new strategy, if you will. Having one queue type with half a dozen properties.

There are limits to how far you can optimize that. Some combinations, honestly, don't make sense. Like, if your queue is both durable and exclusive, what exactly are you asking for?

So, it will, I think, naturally get better or at least easier to make a decision like, “What queue type do I need?” as we ship more queue types.

RabbitMQ clients

Dormain: Let's switch over and talk a little bit about RabbitMQ clients. What can we do to make clients behave better? Let's start with you, Carl.

Carl: Well, one thing we see many clients or customers struggle with is when a node goes down. Normally, many clients have automatic reconnect but some do not. What to do when you get that exception? What is a good reconnect strategy? What to do with the messages that is coming, that you want to publish?

So, what we've seen other clients sometimes do, in other message queue systems, is that they have like an internal queue in the client, and then offline, right, they even have offline support to write it to the disk. And then, when the application comes up again, it reads from the disk and starts publishing it, if the server then is up or the cluster then is up. That's one area of improvement we could see in some clients, maybe.

Michael: We call that feature client side write-ahead-log. What happens is, if a node fails, yes, you can recover. Well, the clients that I maintain, they can recover. Everything else, it's not my problem [laughter]. But there is a window of time between recovery that you still have some events happening in your system. Your publishers need to be able to handle that.

It’s something like Project Reactor. It will do that for you. But Project Reactor only exists for Java. There is no Python version. So, what do you do? So, one design decision is indeed you can introduce basically a buffer in your client, which accumulates messages, and then when connection restores, it just publishes that.

There are several issues with that. I don't know if we ever announced publicly. One is, you have to be mindful of how much memory you have because, at a certain rate with larger messages, you will quickly run out of it.

“Is it okay to drop data?” is another question. Is it okay to republish some messages? Not every operation is important.

Lastly, speaking of cloud native, in a lot of modern environments, the use of local disk is either impossible or highly discouraged. What Carl suggested, and I completely agree that it's a good idea, would simply not work in this brand-new world of cloud native everything. So, what do we do? It sounds like we need three features to address this. And we have to do it for every client. So that's more inadequate features than you have people on our team, unfortunately. I think that's the main bottleneck but we will do it soon enough, starting with I don’t know. You use a lot of Ruby, right? Well, we can experiment with that.

Ayanda: I think the main gap, which you have seen with Rabbit clients is, most users, there’s a knowledge gap in understanding the AMQP protocol fully in order to use it according to best practice with a RabbitMQ instance or a cluster. I think an improvement that can be done on clients is probably some more automation for users, around things like connection pooling because TCP connections contribute a lot to the memory usage on the server. If users don't have to always think about, “How can we efficiently create new connections, reuse connections?” That could help.

I think automation around, I guess, setting up the internals - the internal fabric of a node. Which type of exchange, for example, and the bindings and queues, how to set that up. We’ve done some of this work, with some customers, create an automation layer for them. And all they want is just do a single function call and, behind the scenes, everything is set up for them. I think if we can have things like this, out-of-the-box, that could be useful.

Dormain: Jack, did you have any comments?

Jack: Yeah, I just think that most developers don't need to know about AMQP. They shouldn’t be forced to really know how it works. Most developers are very busy. They just won't learn it anyway. I think some libraries could probably be a bit more simplified, a bit more high level.

I'm playing with some of the Python clients and trying to set up a nice asynchronous publisher that can-- you know, all the callbacks can be a bit complex. I think there's probably room for improvement for some of the high-level libraries to abstract AMQP maybe a bit more away and just simplified for developers. I mean, that's for your average case.

Obviously, there's going to be cases where you want to leverage the protocol and understand it for better performance, but I think most people don't need that level so.

RabbitMQ community

Dormain: Okay, so let's step back and look at kind of the RabbitMQ community at large. There are some things that work really well. There's probably some areas that things could work better. You can look at things by way of vanity metrics or you can dig a little bit deeper. But how would each of you kind of rate or assess the health of the RabbitMQ community today?

We'll start with Michael this time.

Michael: Sure. So, two metrics that I use, because I'm constantly exposed to them, is mailing list activity and pull request on GitHub. As far as I can tell, both are steadily growing - not very fast. Again, it's a 13-year-old project.

There are more people in the mailing list who are not in our team and still fairly active. That is a positive sign to me. That said, I think that Google Groups, the mechanism, so to speak, we use, a lot of people are not very happy with it. For once, it's a Google product with an absolutely awful search. Go try to find a thread from five years ago. Even if you know all the keywords, it just doesn't show up. I don't know what's going on. Maybe they don't store the data for that long. But Google Group search is absolutely awful and I have to use it when I have to dig up a link to an existing discussion, almost every day.

Another problem is that Google services are blocked for over 1 billion people, so we will probably introduce a separate sort of forum. The Google Group is going to stay because it does have a lot of content in discussions, but we will have something else with search which is not embarrassing and that, hopefully, won't be blocked for a pretty substantial fraction of the Earth’s population.

And speaking of pull request, I don't know, it varies greatly between projects. Some get pull requests every day. Others, relatively rarely. But I see a positive trend, personally. Hopefully, that's because it's easy enough to contribute.

Ayanda: I think, from our side of Erlang Solutions, I guess, that will come from our sales team and business development. I've got some. I guess, maybe we need to publicize some of those numbers at some point. But there's a lot of inquiries around RabbitMQ. Most of these, I mean, if you ask them, “Are they on the mailing list?” Most of these are not even on the mailing list. So, even outside the scope of the matrix, which Michael and his team uses, there’s still a whole lot more users who are getting in touch through different channels to Erlang Solutions.

Dormain: Jack? Carl?

Jack: I mean, from working as a consultant, going into a place, invariably, they've got a messaging system in there, somewhere, used to a lesser or greater extent. The big names is RabbitMQ, Kafka and the cloud offerings sort of. RabbitMQ is one of the big names and it still is. I mean, it's used widely. So, just based on usage, I would say it is.

Carl: I mean, we see continuously growth as more and more people want bigger clusters, and more customers, and new clients from all over the place. It's definitely a popular product. Getting so, more and more.

We have an Apache Kafka offering to you but it's not seeing as much growth as RabbitMQ, which can be of different reasons, but that's what we see from our perspective.