In this talk from RabbitMQ Summit 2019 we listen to Scott Corrigan from Nastel.
In order to detect anomalies and prevent runtime issues that can compromise the delivery of business services that depend on RabbitMQ, effective monitoring, alerting and analytics tools can make a big difference. But are all of the available tools alike? Attend this session to discover what Nastel has to offer.
Scott Corrigan ( Twitter, GitHub ) manages Nastel’s technology related activities in the field, including sales engineering and services delivery. In his role, Scott leads a worldwide team that supports customers, sales and partners. With over 20 years of experience in the IT industry, Scott has architected software solutions and managed deployments to enable a large number of corporate customers to succeed in their business objectives.
Uncommon monitoring and analytics for RabbitMQ
This presentation I'm going to give you today is about monitoring and analytics of RabbitMQ. The creators of RabbitMQ, they've always recognized the importance of monitoring for an enterprise-grade messaging system like RabbitMQ. Monitoring that helps us detect runtime issues or anomalies before end users are affected by these and before the delivery of business services is compromised by things like unavailability, or exhausted resources, or even abnormal behavior. RabbitMQ comes with a management interface and some APIs that expose a large number of metrics.
Importance of an analytics engine
There's a lot of good reasons for having an external monitoring system. One of them is the decoupling of the monitoring system from the system that's being monitored which is, of course, very important. That offers the benefit of a lower overhead and also gives you access to a certain number of metrics that are outside of the realm of RabbitMQ like, for example, the operating system, or network, or other pieces of your application infrastructure. It aims to give you a more complete insight into the state of your system. And also, the ability to push notifications and alerts to the people that are tasked with problem triage which is an important thing.
Time series analytics
Beyond monitoring, there's the aspect of analytics, which is something we'll also talk about today, whereas monitoring is principally focused on the state of the system at any given point of time. Now, analytics is more about time series data - analyzing and visualizing time series data which enables you to see some issues that can only be identified over longer intervals of time. By examining this kind of data, you can get a more comprehensive assessment of situations that can affect the business performance of your system.
Importance of an analytics engine
You notice the title “uncommon monitoring and analytics for RabbitMQ” and you might be thinking, “Well, what is it that makes a solution for monitoring and analytics of RabbitMQ uncommon?” Well, the first thing I want to draw your attention to is the presence of an analytics engine. This analytics engine offers extreme scalability and it's capable of processing literally millions of events and metrics per second. It enables you to detect patterns and trends in the data from various sources like middleware, logs, or applications.
Nastel’s Solution Offering for RabbitMQ
There's two parts of our solution offering for RabbitMQ that's predicated on the presence of this analytics engine. There's AutoPilot which is a real-time monitoring part of our solution. That's about what's happening now. Collecting metrics and events from RabbitMQ and other parts of the application infrastructure to identify what's happening now, identifying situations that can have an impact on availability and performance of RabbitMQ and the applications that depend upon it. It's about notifying people that can do something about these situations in a timely manner. In some cases, it's also about invoking remedial solution situations automatically. You can invoke actions that can resolve some of the situations that you can detect with AutoPilot.
XRay is another part of the same solution offering that's focused on the analytics of time series data. Even though it's important to know what's happening right now, it's also very important to get insights from the data that's collected over time to understand how and why things are happening. This time series analytics function that's proper to XRay is really a downstream function with regard to Nastel’s XRay. XRay leverages metrics and events that come from AutoPilot and also from RabbitMQ directly. It also offers an important capability which is the tracing. The tracing of applications as they run in the RabbitMQ environment.
Importance of an Analytics Engine
This analytics engine can be represented visually as a kind of a pipeline where you see raw data that's streaming in on the left side of this graphic and, as an output, you see dashboards that provide visualizations about what's happening within the RabbitMQ.
Now, this raw data is collected from middleware and also other sources within the environment, within the application infrastructure. What's really interesting is what happens inside this pipeline. The first thing we do is we select data points that are to be monitored from a catalog of live runtime events. And, as a second step, what we do is, we specify simple rules that we use to interpret changes in the value or the state of the data points that we've collected. Now, you’ll notice I used the word specifying. That's very voluntary on my part. I say specifying because there's no programming or scripting involved. This simply indicates how these data points should be interpreted based on their current values.
A simple example, let's say that you've discovered, through this tooling, that the amount of memory that's being used by a RabbitMQ virtual host has exceeded, let's say, 15% of machine memory. In a situation like that, you might want to declare a situation like a warning or an error that would happen.
And then, finally, in the third step of this process, based on these conditions, or situations as we like to call them, what we do is we invoke an alert, which means sending an email or an SMS to someone. But also, I think, probably most importantly, doing things like event forwarding to, for example, a manager, or managers, or some system like that, or a service desk, for example. And in some instances, invoking external program, or a script, or something that can actually resolve an issue that you might have encountered.
Data Collection “Decoupled” from Data Intelligence
All monitoring systems, regardless of where they come from or what vendor, they all provide some mechanism for collecting data. And they all provide some mechanism for interpreting that data that they collect. However, when you look under the covers of most systems, what you discover is most monitoring systems have these two functions very, very tightly coupled. The architecture of these tools involve facilities that not only collect the data but also provide the monitoring logic, so to speak. Well, the downside of that architecture - that kind of architecture is that the monitoring of each thing becomes very, very specific which means that once you've mastered the art of monitoring one part of your application infrastructure, well, then you really have to start from scratch and learn how to monitor some other piece of it all over again.
AutoPilot is quite different because the function of data collection is completely decoupled from the function of data intelligence. These are two separate parts of it. There's a broad set of plugins, as you can see here, that's provided to help you collect and route data from various sources to what we call an active data grid. All of the data, the intelligence, is native to the analytics engine. It's not part of the plugins that are responsible for collecting.
Metrics and Events from Multiple Sources
Here, you see an example of what data collection looks like within AutoPilot. You see data from RabbitMQ, shown to the left here, all nodes and clusters across the enterprise. Data from application servers, from enterprise service bus, and even from business applications. All of this data is collected in real time, using the standard APIs and interfaces that are provided by each one of these sources because that's the function of the plugin.
Continual Statistical Analytics in Real Time
Now, AutoPilot itself, neither knows nor cares about where the data comes from, or even what it's about. They're simply data points from the application infrastructure. Any one of these data points may be interesting or important to you, depending on the kinds of situations that you want to monitor and keep track of and be alerted about. So, while AutoPilot is busy collecting this real raw data, in real time, and the analytics data engine is making decisions and alerts about the rules that have been defined by the user, there's something else that's happening at the same time. With each sampling of data that AutoPilot is getting, the analytics engine is running continual statistical analytics.
These analytics, they’re run in the background, on a dedicated server, far away from RabbitMQ processing so that we don't interfere with what's happening on your messaging platform.
You might wonder, “Why is it important to have these statistical analytics?” Well, it's important because it enables us to detect trends and patterns in the data that we might want to use when you specify rules to identify situations that are of interest to you that may be hiding in a data, situations that are not readily noticeable - things that you wouldn't notice. By detecting these patterns and trends in the data, AutoPilot can be used as a kind of an early warning system to detect degraded performance, or unusual activities, or anomalies. These analytics that we do, they range from the simplest to the most sophisticated. Anything from exponential moving average, standard deviation, historical rate of change and many other analytics as well.
Real-time Monitoring, Notifications and Alerting
Another thing about AutoPilot is that it's a completely policy-based system. It's based on simple rules that you specify without any scripting or programming, as I mentioned earlier. It provides a policy editor that you use as an authoring tool. You've got this wizard-based approach that allows you to use a point-and-click, drag-and-drop kind of paradigm that involves those three simple steps: select the data points that you're interested in, describe or define the rules that you evaluate conditions with, and then, finally, invoke alerts and, optionally, actions also.
The data points that are involved in these monitoring policies, they come from one or more sources. They can come from the RabbitMQ environment, from logs, from the operating system, and also those derived metrics that come from the statistical analytics that the analytics engine provides.
These policy-based dashboards they notify us about the severity of detected conditions. They do that through a familiar traffic light pattern, so green, yellow, red is what they're done. The automated actions that you can invoke, in fact, are based on alert severity. They can include anything from external programs, scripts, to third-party tools that you use to alleviate situations that you discover.
Pre-built real-time monitoring for RabbitMQ
With the RabbitMQ plugin, we also provide a set of pre-built policies that provide what I would call out-of-the box monitoring for RabbitMQ so you don't have to build everything from scratch, naturally. It helps you to optimize the performance of RabbitMQ with a sort of an all-at-once view of the key metrics that come from the RabbitMQ environment. You can also create your own monitoring dashboards, with AutoPilot, based on using those pre-built dashboards as a kind of a starting point.
The pre-built dashboards, they're designed to handle the most common situations that a RabbitMQ administrator needs to monitor and cares about. Actually, you can extend these dashboards yourself. You can use these pre built dashboards to detect situations like things like, let's say, a buildup in the number of concurrent connections, or unusually high churn statistics, or high or low message processing rates. Virtually, anything that's seen through the common set of RabbitMQ metrics and events.
Monitoring Configuration Changes, Log Monitoring & Aggregation
Now, in addition to the runtime performance metrics, there's other subjects that you might want to consider such as, for example, configuration changes and data that you can get out of the RabbitMQ logs. RabbitMQ can be configured through a number of mechanisms. You've got configuration files and environment variables, policies as well. Queues in RabbitMQ, for example, they have mandatory policies such as durable or exclusive. And also, optional properties that we call x-arguments that control things like queue length or time to live (TTL). Now, these properties are supplied by client applications when they declare queues or also through policies, or operator policies. These things can take effect anytime, these policies.
Using AutoPilot, what you can do is you can enforce configuration standards for RabbitMQ, for example, raising an alert when a property of an object or an exchange is different from an expected value. You can also monitor log file events. For example, we ship the product with a log file plugin that allows you to define what we call event profiles so that you can do a practical use case such as, for example, submitting an alert whenever an err warning message appears in a RabbitMQ log.
So far, I've been talking all about analyzing these runtime metrics, detecting anomalies, delivering alerts, resolving problems - all about runtime problems. Well, this high volume of interactions in the application environment, especially when you're using a runtime, a real-time monitoring system like AutoPilot, it creates a sort of a constant stream of messaging traffic, monitoring events and alerts. All of these interactions, they have a very large amount of data that they're producing. It's probably what people refer to mostly as big data.
To get insights from that big data - from the application and middleware environment, you need to analyze not only what's happening right now but also have some information about how well the environment is managed over time. This is really the remit of XRay. As I mentioned before, this is a downstream function with regard to what we do in AutoPilot. While AutoPilot is taking care of the here and now, the real-time events, it's also constantly streaming data to Nastel's XRay big data repository where the information is stored. It's indexed and available for further analysis and for research. It's XRay that provides what we call the long-term data analysis for application owners, architects, and other people to enable better decision making.
Here’s an example of a Nastel XRay dashboard for RabbitMQ where we see consolidated information that comes from RabbitMQ, from application metrics, and real-time events that are streamed from both RabbitMQ and also from the AutoPilot platform to produce analytics that you can look at here.
The objective of this platform is to give users better insights into the application and middleware issues in a way that's intuitive, that's easy to understand. They can get better visibility of what's happening during a selected time interval. You’ll notice, in the upper left-hand portion of the screen, there's a time interval that you can specify. You can say, “I want to see what's been going on over the last 45 minutes, or today, this morning, or between to date and timestamps,” for example.” The benefit of this is that the user doesn't have to figure out what has happened during a given time period using various logs or disparate monitoring tools since all the needed analytics are actually shown on a single pane of glass, in one place.
To make dashboards like this, what we provide is an English-like query language, very simple, specification-type query language that analyzes the machine data that's been stored in the repository. There's also a convenient wizard that helps you work with these queries and to generate them.
The dashboards that we're showing here and the queries that they're based on, they're used to analyze behavior, relationships, and patterns in the data - the time series data that that you can use with simple logic. You can get answers without having, let's say, a PhD in in data science.
Time Series Analytics: RabbitMQ Memory Use Over Time
Now, this is an illustration - a practical illustration of a case that involves time series analytics for RabbitMQ which shows, for example, memory being consumed on a given node. What you'll notice here, in the graphic that’s shown, is that this kind of information is relatively difficult to detect using real-time monitoring systems because the data is gradually evolving over time, very, very slowly. Between this morning and this evening, for example, we can see a slow creep in memory consumption that can happen over several minutes or several hours. It's more easily detectable with time-series data than it is in real-time analytics.
Timeline of Performance Events
Certainly, here, we're showing you an example of a timeline of performance events where we're showing a stack chart that shows, within each column, the frequency and severity of performance events that have been observed within RabbitMQ. Mind you, all this data was detected in real-time by AutoPilot in the first phase, so to speak.
You might be saying, “Well, if all of this information has already been dealt with, alerted upon, given to the people who are responsible for problem triage, why would it be interesting for us to look at this in a time analytics fashion?” The reason is, it's really because, very frequently, the people that are dealing with issues find that they're actually treating symptoms and not root cause. By seeing, for example, recurring events, recurring problems and having some analysis of that type of behavior, they get a more insightful view of what's actually going on in the RabbitMQ environment and they can deal with it more appropriately. These are just two examples of many of the things that you can do with Nastel’s XRay.
Content Header Frame
Now, another area where we offer significant benefit is in message flow tracing and profiling. All applications - enterprise applications today that use messaging middleware, like RabbitMQ, use that to implement application integration. In any enterprise workflow, there's sometimes inconsistent performance. I'm sure you're all familiar with situations where, for example, you have colleagues that say, “We did not handle a certain order, or did not receive a certain invoice, or didn't make a certain shipment which is related to a message.” In many cases, you need to find out more about how these message workflows run. Are there performance issues? How often do they involve? What particular orders or invoices or such workflows are affected by these things? These kinds of problems are challenging. The data that's used for diagnosis of workloads in most organizations are usually inadequate or it's just not pertinent.
There's a lot of companies that use APM technologies (application performance monitoring technologies) to try to find out what's happened with application workflows. Those toolings are very much specialized on application issues, whereas here we're looking at the middleware layer. We're looking at the messaging layer to find out what's actually happening with RabbitMQ to avoid having a blind side with regard to our analysis of what's happening at the application layer.
Nastel XRay : RabbitMQ Message Flow Tracing
For message flow tracing, we leverage the RabbitMQ firehose tracer which publishes a trace event message every time a message is published or delivered. It does that onto a specific topic exchange. These trace events, they contain metadata about the original message as well as the message payload itself that arrives in a JSON format.
Message Flow Tracing for RabbitMQ
XRay, what it does, is it streams these trace event messages to its big data repository where they're indexed and they're correlated for analysis and search. The XRay web interface enables users to view these message flows with a variety of graphical charts with useful information like the topology of the message flow, as you see here, where we see drill downs, the elapsed time flow, timeline of workflows from start to end. Users can also drill down from any one of these view lists to see the details of any hop, any message event that happens during the course of a workflow.
Oops, sorry, didn't mean to pop out of there.
RabbitMQ Workflows : Message Flow Objectives and Missed SLA’s
One other I'll share with you is-- and this is actually a drill down from a message flow. Here, we see an example of a certain number of workflows that have been executed over time where each dot of this graphic, you see in the upper left-hand corner of the screen here, represents one iteration of a workflow. Now, all the workflows that were executed here constituted the same number of events, eight events that happened each time. It’s the same standard business practice, same workflow that does the same thing.
Well, you’ll notice, as you look at this, there's two y-axis. There’s two vertical axes that were shown here. One is the number of events, the event count. And the other one is the elapsed time. As you can see, there are a certain number of workloads that are actually exceeding their service level objectives. With this tooling, we capture this information. We hand it back to the user in a graphical way that's very easy to understand, very intuitive. It helps you identify those specific workflows that have not met their service level objectives. Just one example of the many types of charts that you can go for message tracing on RabbitMQ using Nastel's XRay.
Questions from the audience
Q: When you use the Firehose, does that mean that you leave that on constantly to continue doing the analytics or do you just leave it on for a period of time?
A: You can you can actually do both. You can leave the fire hose on in which case we grab the message events directly from the topic exchange where they’re published. You can also direct them to a log file and you can stop and start tracing, for example, using the command line interface for RabbitMQ.
What we're running in the background, in a dedicated server far away from RabbitMQ, is a Java program that reads that information directly off those queues, streams it to our data repository and makes it available. There are a number of customers that they do what they call campaigns where there's a spontaneous use of the firehose tracer for certain moments of the day or when they have suspicions on problematic workflows.
There are still other companies that say, “We consider that every piece of data is valuable and we need to stream everything. We need to be aware of everything that's happened.” Though the ones that were very keen on zero message loss, in fact, they leave the tracer on.
Q: On your log collection, how do you do that?
A: It's probably more akin to log scraping. For example, while the RabbitMQ log files or the server log, for example, hosted on a platform, we install an agent - a local agent on that that scrapes the content of the log and, based on event profiles that you declare, sends to the AutoPilot environment a warning or an alert condition that's based on what it’s read from the log file. It’s a lightweight agent that's barely 2 MB and runs continually on the RabbitMQ server with the log file as hosted.
Q: How does this integrate with the new stuff we had this morning on 3.8, with the Grafana and Prometheus stuff?
A: Good question, yeah. It is true that the version 3.8 offers or integrates support for Prometheus and Grafana. I think that there's a number of things. The first, naturally, the time series analytics is something that's very specific to this. The Prometheus and Grafana capability that's being built in 3.8 is very much the real-time monitoring, analytics. That's one thing. There’s also the time is required for building these things.
I think one of the most important aspects of both the XRay and the AutoPilot product is that they're purely declarative. There is not a lot of preparation, or rolling up your sleeves, or hard work required to build dashboards like these.
Now, if you can master the simple query language - and anyone that's familiar with SQL is going to find this a thousand times easier, you can master XRay. There’s also a wizard that helps you with that.
With regard to Nastel's AutoPilot, that three-step, simplified procedure for specifying the conditions by which there are alerts clearly makes it a very, very simplified user paradigm for doing monitoring and alerting.
I think there's really important differentiators in terms of time-to-value that these products offer that I think, by comparison, if you were to do, for example, a testing of the Prometheus Grafana solution and compare it with this one, I think you'd be greatly surprised by the limit of time required to build solutions with this tool.