In this talk from RabbitMQ Summit 2019 we listen to Gavin Roy from AWeber Communications.
After 10 years of heavy use of RabbitMQ in production, common problems, patterns, and solutions have emerged. In this talk we will cover architecture, configuration and operational management, monitoring, and maybe a disaster story or two.
Gavin M. Roy ( Twitter, GitHub ) is the author of RabbitMQ in Depth. An Erlang enthusiast as well as a member of the Python Software Foundation, he has been involved with many Open Source communities since the mid-90's.
Practical advice for the care and feeding of RabbitMQ
It’s like she said, I'm Gavin. I'm the VP of Architecture for AWeber Communications. We're an ESP located in the States. I've been using RabbitMQ for just over 10 years or so. I am the author of RabbitMQ IN DEPTH. If you're so inclined, there's a nice little code that my publisher gave for 40% off the book. I highly recommend you check it out. It's a great way to learn RabbitMQ from the bottom up which is somewhat the opposite that most books take in their approach to how they teach something, but I feel like it's a really good way to learn why RabbitMQ does what it does.
Before I get into the real part of the presentation, I wanted to talk about perspective because my talk today, by and large, has a lot to do with opinions. I was speaking with a coworker and I have a reputation at work at being somewhat knowledgeable about RabbitMQ. He told me he had a really important RabbitMQ question for me and that was, “Is this the front of the rabbit or is it the back of the rabbit? Is this fluffy cotton tail or is it the nose?” I had never really thought about that before. Here's a logo that I'd been looking at for 10 years and I hadn't really thought about which side of the rabbit it was. I always just assumed it was the face. So, front or back? Are you guys going to be able to move on past that? You know, for me, that was like “Wow?” It gave me a little bit to think about.
All generalizations are wrong
In that same vein, I really like the fact that OB1 admitted to being a Sith Lord in the movie, “Only a Sith deals in absolutes.” I make a lot of absolute statements. I make a lot of ”Don't do this. Don't do that.” Everything is relative. In that vein, all generalizations are wrong. I'm going to make lots of them. Keep that in mind.
Story 1: Our app is Slower with RabbitMQ
I'm going to start with a story. This was actually when we just started to get into using RabbitMQ in production, under heavy load, and we had moved from ActiveMQ. This was in 2009 which brought us this awesome game called Farmville which, fortunately, I never got sucked into but it was pretty popular for the time. ActiveMQ was doing well for us at the time except for the fact that it had lots of memory leaks and we had to restart it in production fairly regularly which is not the kind of thing you want for a message broker.
Migrating away from ActiveMQ
We were in the middle of migrating-- or not migrating. I'm sorry. Our app was a PHP web front-end talking to RabbitMQ. If you're not familiar with how that works, you got the awesome Internet Explorer web browser at the time, Apache with mod_php. Every web server processor connection was a different back-end connection. PHP, at the time, was connected, start the program, execute the program, stop the program, clear out the space which meant that every single connection to RabbitMQ was short-lived. it would connect and then disconnect.
PHP to RabbitMQ
The problem with that is that there are seven stages to negotiating an AMQP connection. We learned, really early on, that under our load, under our scale, that we had - over 100 web servers, a really high level of concurrency - 100,000 requests per second or so, that our app was slower using RabbitMQ. It was because we didn't fundamentally understand the nature of how our app was talking to RabbitMQ. We made some assumptions and they weren't correct. We had moved from using Stomp and ActiveMQ which was a very simplistic protocol, similar to HTTP. Made a socket connection, sent a bunch of stuff, disconnected to connect, negotiate your connection, and then publish your message and disconnect. And so, that the overhead of that initial time basically added milliseconds of latency to our environment where we were very millisecond latency-- you know, it was a concern for us.
Ultimately, what we ended up doing was taking and writing a middleware to sit between our Apache backends and RabbitMQ that was an HTTP service that had long-lived connections in Python to basically lower the load on the RabbitMQ cluster to have long-lived connections and yet still process the large amount of publishing that we were doing.
There was a conclusion that we came up with, high connection turnover and RabbitMQ was bad, that we needed to stop using PHP. Well, that really wasn't true. That we needed to use HTTP to AMQP as a proxy. You'll even find that cloud AMQP offers a similar type of offering built into their product these days for PHP or other environments that have short-lived connections because it really does impact the performance of your application.
Story 2 Reinventing the Wheel
That brings me to my next story which is about reinventing the wheel. Same group, same organization. It was 2009. I miss those people. I was in the middle of figuring out how we take and move from our existing implementation, where we were sending these messages and we needed to know what kind of message we were sending so that we can be on the other end and receiving it. By convention and ActiveMQ, we're publishing to a specific place and we're consuming from a specific place.
AMQP Message Properties
But now, in RabbitMQ, we have all these other options. We've got this topology that we can create with complex routing scenarios, and publish the same message, and have it go off into different places for different purposes. And so, how do we categorize it and provide information around that application in a way that allows us to somewhat future-proof our potential usage and provide information?
And so, the VP of Engineering and I got into a room, along with our VP of architecture at the time, and we had a vigorous debate about our message payload and strategy. What came out of that was something like this. They were strongly advocating putting everything in the message body, figuring out what kind of message it was, when we sent it, along with additional information, what application was sending it, and then having an attribute that was like, “And here's the message as well” which is fine except for AMQP has that type of connotation built into it.
RabbitMQ messages have these things called properties. You're saying, “Well, we're using this tool in for a penny, in for a pound. Let's start using the contract as it’s defined.” That's like crazy. Doing something like this to me was like going back to SOAP over HTTP. Taking all the contract out of the protocol we're using and putting it in the payload. And then, having some sort of system by which we correlate the payload and how we decode it in our programs with, you know, the message that's going instead of using what's built into the protocol.
At the time, I'm advocating, “Well, hey, here are the properties that we have. We've got type. Let's use it.” You know, you want to know when the message was sent? Look, look we've got a timestamp, guys. Come on, let's think about using this the right way.” You know what? We may not always be sending JSON across the wire so let's make sure we're using content-type in content-encoding to describe it. Again, built into the contract.
Finally, something that I actually have come to really appreciate and is probably one of the more understated aspects of message properties the app-id, in convention, saying, “Which application’s publishing your message?” I found, over the years, that being able to take the message and use these properties to figure out information about messages in my message bus are really helpful. They may not be at first blush. They may not be, to the developer, when they're writing the application as they're thinking about them at that time. But, one of the really keen things about flexibility in RabbitMQ is the fact that you can start using your messages over time in ways that you didn't originally think about. So, if that's the case why not future-proof and provide as much context as you can in the built-in data structure that describes the message?
Our conclusion there, of course, was to keep metadata out of the message body because that's not really where it belongs in an AMQP message. It belongs in the properties. And so, let's use those.
They didn't agree with me. Somewhere along the line I think we ended up in a compromise. I've used that principle and tried to apply that principle going forward, again in 2009, a long time ago.
Down the Rabbit hole: Pager Rotation Fatigue
Instead of moving into another story, I want to go down a rabbit hole. Just real quick, from an operational perspective, running RabbitMQ, it's hard to get the things you need to care about right. It's really easy to get all set up get your applications in place, and then make sure that you've got monitoring in place so that you know when things break and you can respond to them and fix them.
Naïve monitoring patterns
If you take a naïve approach to that, it's really easy to create paging fatigue for your on-call resources because maybe there's fluctuation in your environment and, if you don't take a holistic approach to your paging you're getting paged at 2:00 in the morning because of some conditions like, “Gee, I've got two more messages in the queue than I expected.” That's not good.
What I've found in multiple environments, with people monitoring RabbitMQ, is that they basically have two patterns of use. I think this is kind of a beginner intro stage in, “How do I monitor, make sure that everything running in RabbitMQ is running the way that I think it is?”, which is “Are my queue depths larger than they're supposed to be?” which I think is generally a good thing to check for. And “do I have the number of consumers that I expect to be there?”
Those are pretty basic things we check for, right? The problem is those things are naïve. If I have a queue and I've got a hundred consumers on it and I have an environment where, let's say, that I'm running in Kubernetes and resources may go away and come back, or I'm running an AWS and they take resources away from me and then they give them back. I want a program for resilience and not consistent state. And so, why should I monitor for consistent state? If I'm expecting a hundred consumers, do I really want to set an alert threshold that says, “If I have less than 100 consumers, wake somebody up at 2:00 in the morning.” Probably not. But, if I have it at a threshold that is either a fraction of what it's supposed to be or 0, it probably makes a little more sense. Do I really want to have an alert that just says, “Hey, if you've got a hundred messages in the queue, because you should always have 0 messages in the queue, because we're optimized for throughput to keep our queue depths at 0, somebody should know about it.” I'd say that's probably not the best way.
Suggested Monitoring Patterns
Some of the patterns that I like to suggest is, “Alert when a queue has 0 consumers.” That tells you something is really wrong. If you're in an environment where you are going for the paradigm of pets versus cattle, or pets versus chickens, as I like to say when I'm in India, you cannot guarantee that you're always going to have exactly the number of consumer applications that you want but you can make sure that when something’s really wrong somebody's being woken up.
Instead of just having a naive approach to saying, “Hey, my queue depth is at a thousand messages, so alert and wake somebody up.” What you really care about is, “Are my consumers consuming my messages? And, maybe I've got a thousand messages but if I'm running along at a decent clip and because this is a queue, I've had a spike of work, we don't need to wake anybody up.” But if we have a trend, where we have a thousand messages, and over the last hour it's been slowly accumulating, and the velocity of consuming messages has gone down, then that's maybe something we want to look into because maybe there's an operational issue that's causing our throughput to go slower than expected. That's a really important way to look at how you can avoid pager fatigue with regard to the application.
One thing that I think gets passed up a lot is looking for problems in your ecosystem. If you write your applications in such a way where RabbitMQ, your consumer applications, when they're acking or they're responding to messages and saying, “Hey, I've processed this message, so let's move on to the next one.” Or, “I processed this message but there was a problem with it. And so, I'm going to negatively acknowledge it and I'm going to kick it back or I'm going to reject it.” Watching for those rejections, I think, is an important thing to do because it can, again, indicate if your applications are written in a way to use those as an indication of success and failure. It can indicate a failure rate that may be worth looking into, but it can't stop there. Dead-letter messaging, in a combination of that, will allow you to accumulate those messages on another queue which will allow you to interrogate those messages and figure out, “Hey, what went wrong? Beyond looking at logs, let me look at the raw data so that I can try and recreate the conditions that caused me to negatively acknowledge it in the first place.”
Finally, one of the things, if you're running an HA cluster and let's say that part of your normal operational process and that is to do restarts for upgrades or anything like that, one of the things that you'll find that can happen if, for example, you're not completely distributing your HA queues across your entire cluster. And even in that scenario, you can find that there's unbalanced distribution of primary ownership of those queues on a particular node. Or, maybe if you are doing, with a three-node cluster, you've got HA mode of exactly and you've got 2 as your value for the params or how many nodes are going to have HA for that particular queue. You'll find that, over time, you will end up with an uneven distribution of those queue ownerships on the secondary nodes across. So, keeping an eye out that for that. Maybe that's not something you throw an alert for but you've got a dashboard for that that kind of shows you that there's an uneven distribution because that will create a hotspot within your environment that will cause CPU, memory, network utilization, disk utilization to be higher on the nodes where the hot spotting’s occurring than on the other nodes.
Story 3: “RabbitMQ is Slow”
So, my third story. If you can't tell, I'm going through a bunch of stories. Fun things like that is, I started at AWeber in 2014. I went into the environment and they were already using RabbitMQ. They were using some of my software to interact with RabbitMQ which was really cool, kind of going to a new environment and see that they’re already using things that I wrote. Yeah, neat.
For me, in the ways that I use RabbitMQ, that doesn't work for our business. I would imagine that for most of you that's the case as well.
Now, this was 2014. To take you back to what was happening in 2014, we got cool Android watches and cool Apple watches for the very first time. It's hard to think that that was only, what, five years ago or so that these new devices came into play.
What I walked into was an environment where developers were disgruntled. They had predisposed ideas about what they should be doing. They didn't necessarily make the technology choices that they were using. They were told that they should do things this way or that way. And like many times, when programmers think that their way of doing things is better than the way that's currently being done and they should use other technology, they were predisposed to set up the environment for failure without actually looking into what they should be doing.
In this case, they had a three-node RabbitMQ cluster, and they had a set of publishers that were throwing data into it, and consumers that were taking data out of it. A pretty simple pattern, HA queues with ha-mode: all. So what happens in ha-mode: all is the publisher comes into the exchange. They put it into a queue. That state with that queue has to be synchronized with the state of the queue across all the other nodes. So, I publish a message. It comes in. And then, I've got a consumer over here, and every message in is synchronized, every message out is synchronized. And, of course, that creates a bottleneck.
ha-mode: all && delivery-mode:2
In addition to that, they were publishing with delivery-mode: 2. Delivery mode: 2, for those that aren't aware, is the mode that tells RabbitMQ “persist the message to disk”. Not only are they coordinating across all three nodes, they're also saying, “Write these messages to disk when they come in.” These were fairly big messages with big data payloads that were used within our system for figuring out trends in email delivery and reputation for our customers. So, every single one that would come in would get synchronized across to disk for everything and go out the other side.
Not a big deal. I mean that's built into RabbitMQ. Again, not an issue. We take it a step further and we find that they are virtual machines. Not only were they virtual machines, they were small virtual machines comparatively to the workload that was being put in.
In this scenario, we've got a rabbit cluster. Conceptually, if I was running on one hypervisor, for example, I've got all three nodes but because we were really concerned about not losing messages, and high availability, and continue-- you know, how many 9’s that we have to meet for our uptime. It's a different node across different hypervisors and we're coordinating across all of that. So, I started asking questions. I'm like, “Well, you know, did you think about the throughput requirements of your application before you started? Did you make sure that we're provisioning the right type of virtual machine as far as number of cores, amount of memory, amount of I/O capacity?” Well, no. The ops guys take care of that stuff. They just set up the cluster for us and we started publishing to it.
As it comes to fruition, investigating the whole scenario, (a) We didn't need to ensure that every message was written to disk. With ha mode, we’ve got three servers that are ensuring that we're not going to lose any messages. The whole point of writing it to disk was to ensure that there was durability of the message and it would persist across restarts. Well, if you don't restart all three nodes in the cluster at the same time, you don't have an issue, right? It turns out that our primary culprit there was the amount of I/O that was required for the size of the message that was being published in, to persist it to disk, and to do it three times, and deal with all of that.
Ultimately, our conclusion was that the VMs were way underpowered for what they were doing. We're expecting RabbitMQ to act like a database server but we're not providing it the same type of infrastructure that we would provide a database server. We also had these things running on oversaturated hypervisors, so there were you know lots of VMs on each of these hypervisors and they were all competing for resources in the work.
Somehow, the engineers were thinking, “Well, this is Rabbit’s fault. It's slow. It's terrible.” And it turns out, Well, no. You were basically putting it on an overly congested piece of infrastructure and that's where your speed issues came from, in combination with the demand that you were putting on it by having it do additional I/O on every node. So, turning off delivery-mode: 2, bumping up the resources, and moving things around within our operational environment got us to a point, pretty quickly, where all of the objections fell away and everything was running and all good, so yay.
Down the Rabbit hole: Networking & AMQP
I want to take you down another rabbit hole here and talk about networking. This is probably not the thing that you were expecting to come in and talk about at a RabbitMQ talk, but I feel like really understanding, having a holistic understanding, of your application stack, your operational environment, and the various levers that you can move to affect performance, and throughput, and your application are really important even if they're areas that you yourself don't necessarily interact with on a regular basis or even at all because there's that ops group over there that deals with this kind of stuff.
How many people in this room are familiar with the OSI model? A good percentage of you. That's great.
The OSI model is just a way that we could describe our networking stack and talk about the various layers of things that impact how data gets moved from one place to another. And with regard to most things that we do, in this case we care about the ethernet frame layer, the transport layer at Layer 4 which is TCP/IP, and our application layer which, of course, is AMQP or that could be HTTP for the RabbitMQ management interface, or API, or Stomp, or MQTT.
One of the things that I like to point out in talking about the stack and the model, with regard to this, is we're just dealing with data structures and data structures on the wire, but we're dealing with multiple layers of data structures as they go.
At the ethernet frame layer, you could see we basically have a data structure that says, “Take this message from this place to this place. Here's the body of the message and here's the checksum for the message.” Pretty simplistic, but you can see that ultimately at each one of these different layers, in essence, it's the same type of thing. So, on top of Ethernet, we add IP and that has all sorts of fun information about the payload of the message that we want to send. Where is it going? Where is it coming from? And how big is it? I forgot how big it was is also one of those things in ether types, the third part of the ethernet frame.
And then, on top of that, we put TCP and that gives us basically control over additional information but we always see that we have these embedded payloads all the way down here to the AMQP frame, down there. It makes sense, right?
Default MTU Frame
If we then start to think about AMQP, RabbitMQ is sending messages back and forth. And we take it all the way down to that first source, the ethernet frame, the default MTU in most networking environments is 1500 bytes. That only gives you 94.93% frame size efficiency. What that means is that, for the amount of data that we're putting into that frame versus the overhead in the data structure for that frame, it's only 94% efficient or 95% efficient.
Default Payload Size
When it comes down to it, after TCP, that means that we only have 1460 bytes available to us in the payload of the TCP frame to send our RabbitMQ frame, which could be part of a message, it could be a command, etc.
At least 3 Frames per AMQP Message
Then we think, there are three - at least three frames, for every RabbitMQ message that we publish, which is the RPC command which is, “I'm going to send you a message. Here's the exchange to publish it to. Here's the routing key to use for that message.” We're going to send a frame with the content header which is going to include how big the body of the messages and all of the message properties associated with it. And then, we're going to have one or more body frames. And, within that body frame, your message. That could be the first 131k of data, if you're publishing big messages, or a single frame if it's smaller.
And now we start thinking about, “How do we divide that up into 1500 or 1460 bytes?” right? You’d start doing math around what's happening. How many packets per second am I going to be sending in my application, as a high-velocity publisher, to be able to support the size of data that I have? And then, that backs you into, “What kind of infrastructure do I need to be thinking about with regard to my networking stack, and my networking devices, network cards- all that kind of fun stuff, within my messaging solution?”
Jumbo MTU Frame
And that, finally, brings me to talking about jumbo frames. So, 1460 bytes is probably okay for a lot of the RPC-style messages going back and forth between RabbitMQ. But if you have messages that are larger than, let's say a K, you probably want to be thinking about, “How can I optimize my networking side of things?”
If you have the ability to use jumbo frames, that gets you to 99.14% frame size efficiency as far as the overhead for sending the messages on the ethernet stack versus the amount of payload that's there. Those kind of things make important and subtle differences in your application performance. It's not all just Layer 7, as it were.
Story 4 Inflexibility Strikes
This is a fun one. This is an interesting debate, let's say, about how you should be managing the runtime configuration of Rabbit. When I say runtime configuration, I'm talking about your exchanges, your queues, your bindings, maybe your virtual host, your users, policies, permissions - all that kind of fun stuff.
Again, we're talking about 2014. We’ve got this awesome selfie from Mars.
While the awesome people at NASA JPL and everybody else involved in a project of sending machinery to another planet to go investigate and take pictures, I was dealing with a much less important in the grand scheme of things, a technological debate.
Upgrading from an old RabbitMQ version
We were going through a process of upgrading from an old RabbitMQ version. And when I say old RabbitMQ version from-- it is 2014. I want to say it was like a version from 2010, maybe something like that, you know, set it up, forget about it when it's running in the environment, to whatever the current version was at the time.
That should be a fairly straightforward process, someone would think, right? You basically - whether you're doing Blue/Green, AB, setting up multiple clusters, moving traffic around - ideally, it should be the kind of thing that, with some pre-planning configuration changes, should be pretty easy.
Opinionated Consumer Applications
Unfortunately, during that process, we ran into opinionated consumer applications. In essence, on startup, these applications were declaring the exchange, and declaring the queues, and binding them. In that statement, as it is, that's pretty innocuous. It's not a big deal to have those things happen and pretty normal. When you go look at the RabbitMQ tutorials, that's one of the very first things that they show - declare your exchange, declare your queue, bind them. Now, do your things. Great.
But what happens when somebody decides, at the time that they're writing their code, that they want a direct exchange. But then, three years later, we decide, “Well, you know, we really should be using topic exchanges because it provides us better flexibility in what we're doing. Oh, and by the way, the person who made the decision is no longer with the company, didn't document anything, left us with code running in production that we had no idea was hard-coded with expectations of how the topology in RabbitMQ would be set up.”
When it comes down to a consumer application, really, all it needs to know is “Where do I connect and tell RabbitMQ I want to get messages from?” Ideally, those are configured things not things that are hard-coded in code.
As we went through this process, we got like, I want to say, 90% of the applications moved over really easily. Configuration management's a wonderful thing. Being able to go through and change applications quickly allowed us to get everything going. That's when we ran into problems.
This fine gentleman Dan, who I worked with at the time, he and I went into a room and spent about a day fixing hard-coded applications with hard-coded expectations of what was running in our RabbitMQ environment and how it should be set up. You could tell, he looked really happy about that, right? That's about how I felt as well. It was a real face-palm moment. It was somewhat difficult to work through.
But when we think about the motivation about what the developers were doing, they weren't wrong and they weren't bad. They were trying to create automated, repeatable configuration. That's a really important thing. It's a really good way to ensure that your environment is consistent. You wanted to make sure that there was a reliable consistent source for that topology to make sure that the applications still work. And that could be used for disaster recovery. It's one of those things where you want to make sure that the environment is set up correctly for your application to do the things that it's supposed to do.
Finally, for them, it meant for a very easy client application configuration because, if the client application is the thing doing the configuration, you don't have to worry about the other side. The application’s doing it all for you.
From my perspective though, that was really taking RabbitMQ and only using part of its functionality. There's a lot more to RabbitMQ than just the AMQP protocol and the ability for these applications to connect and do these things. You've got the management UI. You have configuration settings that you could create to have RabbitMQ do onload when it starts, that sets up the topology for you.
So, really, then it was, “Well, how do we take those applications, with those hard-coded expectations, still meet all the criteria of having automated, repeatable, consistent configuration across our environments, but take the control away from the application itself, just make sure that the contract is met for what its expectations are?”
Bi-Directional Git Sync
What we ended up settling on was a bi-directional Git-based sync process. In our internal GitLab instance, we have a repository for each one of our RabbitMQ clusters. That repository holds a JSON file that is the topology of the RabbitMQ cluster as you would get when you do a export from the management UI or the API. Well, kind of.
The idea there is that, if a developer wants to add something new to the topology, I'm not going to force my view of using the management UI to control everything on them. They can make a PR because it was really important to the engineering team, at the time, that everything goes through a code review process even if it's not code. So, let's have a Git repo. Let's do merge requests and make sure that everything happens the way we want it to. We'll pull that down onto the cluster, into a local Git repo. We’ll use scripting in our configuration management to automate the application of those definitions into RabbitMQ, and we will pull the current state of RabbitMQ out, pull that into the Git repo. If there's anything different, we'll commit it and push it back to Git Lab. It was a nice compromise because it satisfied my concern. I didn't want to be constrained.
One of the things that I've found, over time, is that due to whether it's operational issue, upgrades - whatever I needed to do, I liked the flexibility of going into the management UI and making changes in our production environment to accommodate whatever the event was and not being constrained. Maybe I needed to offload all of the traffic on a single node in a cluster to another one, so I set up temporary queues where I've explicitly said, “Hey, put this queue over on this node. And now let's move all the messages there. And let's move the client applications over there.” I like being able to do that.
This was a nice compromise but one of the problems is that the RabbitMQ API that exports the definitions and JSON, in its own nature, is non-deterministic in it’s sort order. If you take, just straight up, I'm just going to do an export from RabbitMQ's API of these definitions and you put into Git, every time you run that, the file is going to change - even if nothing in the file has changed, so we threw together a really simple Python application that does deterministic sorting of that file, so that way we're only making commits when actual content changes.
Our conclusion there is don't use config management for configuring RabbitMQ’s runtime state. You'll see this in things like Chef, Ansible, or others where you'll have objects. You know, declare my exchange and here are the properties. Declare my queue and here are the properties. We've avoided that and we've tried to make it to where it is a somewhat democratic process as to how that stuff gets managed. It’s still in Git. It’s still applied via configuration management but it's not done in code. Now, again, all generalizations are wrong. If that doesn't work for your organization, it doesn't work for your organization. It worked well for me.
Publishers and consumers should not declare exchanges. I'm going to stick with that one across any environment because exchanges, and queues, and bindings - that's runtime state that is used to reflect your operational environment. It's somewhat of a separation that you would think of in the same way that you would think of a load balancer or maybe a router. Yes, there should be that configuration but your applications really need to care about it, probably not.
And finally, there are applications that do need to care about it. For example, if you're using RabbitMQ in an RPC paradigm where you've got, let's say, a web service that is publishing RPC requests over RabbitMQ using the reply to header, it's going to need a place for the replies to come back to so it should declare an exclusive temporary queue for itself to receive its replies. That's not the kind of thing that you want to keep in a Git repo.
Down the rabbit hole: Blue/Green Upgrades
Getting closer, promise. We’re going to go down the rabbit hole one more time. We're going to talk about Blue/Green upgrades because this was another common theme than I found across environments. It can be. It's getting a lot easier but it can be hard to upgrade RabbitMQ in production. How do you do it in a non-disruptive way?
Getting closer, promise. We’re going to go down the rabbit hole one more time. We're going to talk about Blue/Green upgrades because this was another common theme than I found across environments. It can be. It's getting a lot easier but it can be hard to upgrade RabbitMQ in production. How do you do it in a non-disruptive way? If I'm going from, let's say, 36 running on Erlang R19 and I want to be really cool and go to RabbitMQ 3.8, running on an Erlang R21, I can't just swap out a node one at a time. Now, if I'm staying on the same Erlang version and I'm in the later versions of RabbitMQ 3.7, I think,15 forward, and I want to go to 3.8 or I want to go 3.8.1, I actually can now make those types of changes. But, in the past, it's been pretty difficult.
Getting closer, promise. We’re going to go down the rabbit hole one more time. We're going to talk about Blue/Green upgrades because this was another common theme than I found across environments. It can be. It's getting a lot easier but it can be hard to upgrade RabbitMQ in production. How do you do it in a non-disruptive way?
And so, basically, there's a really simple pattern to follow using built-in tooling within RabbitMQ, the shovels, to set up your two clusters. We're using Git for our topology, right. And so, we know what all of our definitions are that we need to be able to set up our new cluster. Let's use configuration management stand up, our new cluster, our new virtual machines, or a new set of deployment in Kubernetes, or whatever your topology looks like. And then, we've got our publishers and our consumers on our Blue Cluster. We simply move our consumers over to the Green Cluster and we use shovels to bridge the two clusters together.
One of the reasons why we start with consumers is, by using the shovel - the shovel’s acting as a consumer on this queue here and it's just moving it right across to the same queue name on the other node-- or the other cluster. And in essence, we can ensure that there's a consistency in our application that is non-interrupted, other than the restart that's required or the reconnect that's required for our consumer application to connect to the other side.
In the next step, we move our publisher. Our shovels are still going. Any messages that are in our old cluster are still there, being moved across by the shovel, but now we've moved our publishers across and our application is still working. We wait until there's no activity in the Blue Cluster and we get rid of the shovels. Pretty easy.
Key Take Aways
Key takeaways that I have for this. Always use the contract AMQP provides. That's the properties, the exchanges, bindings, routing keys. Don't naively monitor queues for messages ready or consumer counts. Monitor for the absence of consumers and queue depth in relation to velocity. Make sure your infrastructure supports your workload decisions. That's a really important one. And, in most cases, you don't need delivery-mode: 2. Don't declare your exchanges and queues in your applications. Use Blue/Green deployments.
Questions from the audience
Q: What monitoring tools do you use?
A: We're in half state. All our stuff is a combination of Diamond and Nagios?. These days, we've been moving towards Telegraf, using Grafana for our paging on our front end and a mix between InfluxDB, Prometheus, and Graphite.
Q: You mentioned not having a need to use persistent messaging.
You mentioned not having a need to use persistent messaging. My question earlier--
Yeah, you mentioned not having a need to use persistent messaging. My question earlier was going to be, so presumably, that means you keep the system up all the time and you never have to turn it off. I guess, the Blue/Green deployments, is that a way that you can 24/7 without having the problem of shutting down and losing messages?
A: Yes. That, combined with HA queues in clusters because, if you have a cluster and you're ensuring that the messages are replicated across all nodes in the cluster, you have a far lower probability of losing any messages.
Q: A quick question on how you do the Blue/Green.
A: In our environment, we use console as our service discovery layer. Each cluster has the same generic name. It may be cluster-a.service.production.console, but it also has a tag associated with it which is Blue/Green. So, in our configuration, we could target and say, “Go to the Green Cluster instead of the Blue Cluster.” With the way that our applications are written, they basically will make that switch automatically when we make the key change and console that says, “Go here instead of there.”
Q: What is an optimal way to balance out the queues in an HA cluster?
A: To balance the queues. There are a couple of applications out there that do that. I'm totally blanking on at least one of the names. I wrote a Python script to do that. I think it's called RMQ cluster rebalance but there are a number of tools that do this which basically provide a way, using the built-in policies within RabbitMQ to move the primary queue from one to another and kind of take into account the number of queues and figure out what the optimal strategy is.