In this talk from RabbitMQ Summit 2019 we listen to Michael Klishin and Diana Parra Corbacho from Pivotal sharing news about the new version of RabbitMQ.
Hear from the RabbitMQ core engineering team about what near term updates to 3.8 will include. Making RabbitMQ more protocol agnostic, easier to manage and scale, and changes to the schema database.
Keynote: An update from the RabbitMQ team (version 3.8)
The agenda for this presentation is a discussion on the present state and the future of RabbitMQ 3.8, with a special mention about the next-generation RabbitMQ 3.9, and finally an update on our 4.0 project called Mnevis.
The State of RabbitMQ 3.8
Today, the current state of RabbitMQ 3.8 is that this offering has a nice set of features with improved operations. The main addition to this is a new type of replicated queue, known as the Quorum Queues. With this updated version, we expect to address some of the issues that users may have. This however is not the same replacement of the earlier version and these queues will still co-exist.
In this context, we have also done quite a bit of work to be able to simplify the upgrade. At present, it is possible to run both versions simultaneously, with the ability to transition seamlessly from versions 3.7 to 3.8 without having to take down the entire cluster.
The most vital improvement however is the matrix aggregation and visualization. The new built-in Prometheus plug-in allows you to export the entire metric to Prometheus and visualize them with Raft. This has a really tiny impact on the cluster, but nothing compared to the management UI. In fact, this is more powerful and configurable. But, if you still like the management UI, you cannot use Prometheus while still maintaining the management.
All of this work was done with the metrics and we have also added a few more metrics as well. It is also possible to use all this to support authentication, not only for the broker through the client but also for the management UI and API.
In every release basically we have worked more on the command line tools and now we have better revised health checks that can be used to monitor the system performance. There is also a new feature available for most queue types, both the classic and the new queues for the single, active consumer. Several consumers can be connected to our queue but only one of them will be consuming messages at one point of time. Meanwhile, the rest of the consumers will now be waiting in queue and will only take over one of them when a consumer dies.
Classic Mirrored Queues and Quorum Queues
While having implemented the Quorum Queues, you have used the mirrored queues before. You know that they are not very resilient to network failures. The way they recover or are not able to recover is not very predictable. They are also not very efficient because the internal implementation, all the replicas of the queue is connected in a ring topology, which is basically a linear algorithm. So every time you publish a message, you have to wait for it to go through the whole cluster before you can confirm it.
Quorum Queues on the other hand are based on the rough consensus algorithm. The way they recover from failures is predictable and very well defined. Also, the way a leader (previously called master) is selected is very well defined and takes very little time. If a mirror goes down (that is now called a follower or a leader). The recovery is very quick and the transfer of data is very small. Also, when a follower is thinking your queue is still available. This did not happen with mirror queues. If a slave was thinking, the queue was unavailable during that time. Sometimes, this could be quite long if it has a big backlog. There isn’t any ring topology anymore. The replication of the messages between queues is made in parallel, which is way faster.
We have already achieved a higher throughput than with mirror queues on this first implementation. On the other hand, this will take more memory, because the data is stored both in the memory and disk. So, there are some trade-offs. The reliability of the protocol is better suited to multi-tier application. This is something that we may see in the future.
How is the Quorum Queues Different?
Quorum Queues guarantees consistency. Also, it ensures better recovery from failures in a more predictable manner. In this, there is a better data safety model. All the data is always written to disk. This means that we will fsync every single message that goes to the disk. Hence, there is a trade-off with higher latency compared to mirrored queues. But, you must agree to consistency over availability. So, if you want your data to be safe then this is undoubtedly the option for you to choose.
To achieve this consistency, the majority of the nodes on the cluster should be available. We recommend it for at least three nodes or more. The features that the Quorum Queue provides are very similar to the Mirror Queues; however, not everything is available. For example, it doesn’t make sense to have transient queues or exclusive queues. Due to this reason, you will choose Quorum Queues because you want your data to be consistent all the time.
We have already implemented some of these things like the Letter Exchanges or the queue length limits. But, others are not yet there like TTLs. Also, we have added a new feature that is not available in the Classic Queues – Poisonous Message Handling. When a message has been returned too many times and not acknowledged, you can use a policy (delivery-limit) to decide to deliver the message or to drop the message. So that this message doesn’t keep polluting your system.
Quorum Queues Use Cases
As we see there are some differences between Mirror Queues and Quorum Queues. The use cases are a bit more specific for Quorum Queues. You will choose Quorum Queues when you want the durability of the data and happen to persistently miss out messages that are replicated. But, we require a higher consistency from all your data units at least a cluster of three nodes if you’re using a single node or two nodes.
Quorum Queues: When Not to Use
If you cannot set a higher latency, then this is a good choice. If you need a low latency system then it is always better to go for the Mirror Queues. This consistency becomes essential because you need the applications to do their part. You need your applications to use publisher and consumer confirms for mandatory publishing. If they don’t do that, then there is no reason to use Quorum Queues as we cannot provide these guarantees.
Quorum Queues are not meant either to have very long backlogs to perform long time storage because the way the system works to guarantee this consistency. It is essential that we keep a log of all the operations such as the messages that are stored for a long time. In the past, we had to keep all the data from that point of time even if the message has been already acknowledged from the consumer and are out of the system.
If you want transient queues, you don’t need to put on queues either. If you want a very low latency, we close. If your system doesn’t have a very big memory, maybe Quorum Queues is not the best option. But, you can still use the configuration option to limit the amount of memory that was on queue state. The default configuration will store everything on memory, but that can be limited. You must take it and test it.
The next major feature in RabbitMQ 3.8 is simplified upgrades. We introduced a subsystem called feature flags that are used by nodes to communicate their capabilities to their peers. For example, if you cluster node 3.7 together with 3.8 all nodes will behave as close as possible to 3.7. Once all of the cluster nodes are upgraded and your deployments are manually enabled by the operator certain feature flags like those new 3.8 features will become available on all the nodes.
As also stated, this feature is meant to simplify upgrades. We do not expect you to run mixed version clusters for days and months, but for several hours. How long it may take to operate a cluster is now an option as blueprint deployments are still there, if you want to use that approach. But, rolling upgrades are reasonably easy now (we hope) so most people use them by default. Unfortunately, this does not mean that any point in the future we wouldn’t introduce a breaking change that would require cluster white Chantal as one of the complications in distributed systems is that sometimes you may need to change the behavior of all nodes at once. In that case, you still have the blueprint deployment option.
Built-in Prometheus and Grafana Support
This happens to be the favorite feature of some of our team members where we have a built-in support for filmmakers and a set of co-founder dashboards to go with. So, it’s a plug-in that needs to be enabled - nothing to configure really there. All you need to do is point your primitives instance at it and install some Grafana dashboards from Grafana.com and you should be good to go.
This is basically the management UI’s overview page counterpart. It has more or less the same data, except now you have relevant data for all the nodes in one place. Previously, the overview page had some cluster-wise information. To see certain other metrics, you had to open multiple tabs with no specific stats. So, these Grafana dashboards give you a cluster overview and it is much easier to see. Say, may be one of the nodes has a disproportionate number of client connections and things like that.
Why do we want to use Prometheus and Grafana? If you think about the management plug-in that we have today, it couples your monitoring solution with the system being monitored which is not necessarily optional. This way we decouple those two aspects. At this point, it is significantly smaller and simpler so it has lower overheads. It has to do a lot less work without any aggregation. Matrix can be stored for any period of time that primitives and your disks allow for. The user interface is much more powerful and you can share your dashboard. If you want to share some metrics with the team you can do so. If you want to share the metrics with a different team you don’t have to mess with the screenshots and all of that.
Anyone having access to the metrics is orthogonal to certain RabbitMQ permissions. Since the nodes just serve their own metrics and don’t do cluster-wise aggregation, failures of other nodes would not introduce timeouts for some requests. This is because it only has to serve its own data and the probability of getting the time of this is significantly lower.
We decided to not stop there as there are people in our team who really love metrics. So they decided to add more about say memory management or internal communication stats. Those things were previously possible if you know where to find management UI or you knew what relatively obscure tools to use. We also have Raft metrics and if you would like to see more metrics, please talk to a member of the core team.
The memory management slide shows all the various heap types – their breakdown per node, allocation and deallocation rates, and all kinds of other interesting and important details such as what the colonel sees versus what the runtime sees. You can just install this for free.
This is a set of panels related to Erlang distribution. You can see traffic rates between nodes, whether they have any backlog in the internal communication buffer. Also what nodes were connected to what other nodes overtime, which we’re trying to connect for example. This provides an insight into the system state that actually was not possible before. In fact, we saw some early users last month on the mailing list to find some irregularities in the networking set up exactly. We credit this insight to this set of balance.
This is a continuation to even more metrics and you can see that some of them are in red. So, if they spike above a certain level which you can configure it means that maybe it needs the operator’s attention. Then when things are in green then hopefully your systems are in a good state.
If you really want to study RAFT’s behavior especially underload, then this is the set of dashboards that our team uses to evaluate system behavior under stress test.
Areas of future RabbitMQ 3.8
With our version 3.8’s release, you can expect some more of these features to come in before we ship 3.9. So even though 3.8 is out, we still haven’t finished working on it. It is only about bug fixes. We are working on implementing RabbitMQ for Kubernetes from Pivotal. We are doing quite a few Quorum Queues improvements as you know. These are the new features so things are still maturing. But, we have already managed from 3.8.0 to 3.8.1 release to lower the memory footprint. Lower than 25 percent for some workloads. Also, the CPU memory footprint for some use cases is 10 percent lower.
We have also released a new RabbitMQ CTL command with our latest patch release. You can use this to re-balance the leaders across the cluster. Hence, after a rolling upgrade all your leaders end up on the same RabbitMQ node. This script can be used to distribute and run the cluster. So, your load should be more balanced inside.
RabbitMQ 3.9 Plans (subject to change)
We would like to have a completely protocol agnostic core by which we can support more particles in the future. Currently, they are a little bit tied to the original particle that we implemented. Sometimes that introduces overhead or subtle correctness issues, which honestly shouldn’t be there.
The popular plug-ins – management UI – badly needs a rewrite. It will focus more on management now that metrics are handled by Prometheus and Grafana. There is a delayed message exchanged Lichen, which is also very popular. It is good to make this distributed as it has a natural fit with Raft and the way it is used. So what that means is that you will have a single set of delayed messages and publishing, which is not node specific. This feature will be available in version 3.9.
RabbitMQ Enterprise Extensions (coming 2020)
RabbitMQ Enterprise Extension is a set of extensions for the open source RabbitMQ. Note that RabbitMQ license is not going to change. This set of extensions will be commercially licensed and targeted that companies have in some cases elevated regulatory requirements. It involves areas such as whether data can be stored or not. At some point such capabilities will be included in RabbitMQ for Kubernetes by Pivotal.
Some of the more advanced extensions include a wide area network friendly replication of schema for user permissions for queues as preferred by many users. It is easier to extend this idea for multi-datacenters. Additional features include optimization of internal traffic for instance, using the standard compression algorithm developed originally by Facebook which is unbelievably good!
It uses more than one link to transfer data is an area of research because the single TCP connection will have limitations eventually and cease communications.
An update on Mnevis
How does the next-gen schema storage compare?
Last year, we worked on a research project called Mnevis. So, what is Mnevis and why do we need it? It is a schema data store. We used Mnesia, which was developed to support certain workloads for hardware applications. Honestly, that’s not how people run RabbitMQ on most online systems today. So Mnesia had certain opinionated design decisions that are not the best fit.
What if we were to keep the Mnesia API, but replace its recovery and data merging strategy with the way Raft works and have a single operation log with a lot of these features? Such as being able to extend to multiple data center environments and also backup incrementally. Well, that would be pretty cool right and Mnevis is trying to achieve this!
It would also slightly improve scalability in Mnsesia as it’s a reasonable data store, at least for RabbitMQ’s needs. But, again we believe that it can be optimized. Lastly, Mnesia’s development is tied to Erlang/OTP development, which uses yearly releases. Making changes to the primitive codes can be a tedious process and with several patches being released by our team every month, we certainly want to overcome that. With Mnesia that isn’t quite possible.
The state of Mnevis
We have a functional prototype, an experimental RabbitMQ branch that uses it, and a couple of open-ended questions. So, one is that RabbitMQ has transient entities such as exclusive queues.
As a side note, the way RabbitMQ uses Raft today and the way Mnevis would use it involves concurrency controls and transactions is a different workload and an opening question. Other than that Mnevis is doing reasonably well and we are going to focus it for 4.0.