Decentralized observability with OpenTelemetry, Part 1
For years, we’ve monitored our own fleet of LavinMQ and RabbitMQ clusters to ensure stability. It worked well, but as we grew, our central system started to struggle. We realized our static architecture had hit a wall. Here’s why we migrated to OpenTelemetry for a decentralized future.
Why we needed change
Our observability system serves two key purposes: providing us with fleet monitoring and troubleshooting capabilities, and offering customers metrics, logs, and alarms through integrations with their preferred observability platforms.
RabbitMQ's management API has long served a dual purpose: acting as both a control plane and a metrics delivery system. This coupling causes performance issues when RabbitMQ is under high load, making metric collection unreliable.
In November 2019, RabbitMQ 3.8 introduced Prometheus support. The Prometheus plugin offered clear advantages: lower resource overhead, reliable metric delivery even under load, and a wider array of metrics than the management API provided. This industry-standard approach decoupled monitoring from the broker itself, eliminating a key architectural weakness.
We could have continued using the management API, but the introduction of Prometheus support presented an opportunity to leverage these advantages and address fundamental architectural issues in our own observability system. Since we needed to rebuild our Prometheus metrics system anyway, we decided to rethink the entire architecture.
Problems with the legacy system
Our centralized system scraped metrics from all servers and sent them to a LavinMQ broker. From there, various services processed the metrics for alarms, integrations, and storage in InfluxDB. This architecture had several limitations:
Scaling challenges
Scaling the system proved difficult. As our fleet grew, we constantly needed to scale up our scraping and processing services. The number of metrics per broker varies dramatically. A broker with many objects, such as queues and exchanges, produces far more metrics than one with fewer objects. This forced us to over-provision our services for peak loads, leaving significant idle resources most of the time.
We even had to limit metrics data to 2 MB and cap queue metrics at 6,000 per server because handling peaks would have cost too much on our infrastructure platform.
Latency issues
Our centralized architecture created unnecessary latency. Customer metrics traveled from their broker region to our central processing service and then back to their observability backend, often in the same region as their broker. This round-trip significantly delayed metric delivery.
Integration limitations
Adding support for new observability vendors required significant effort. We had to rebuild metrics for InfluxDB and each customer integration, making it difficult to expand our integration offerings.
Architecture requirements
The migration to Prometheus metrics presented an opportunity to redesign the system. We established clear requirements:
- Better resource distribution and optimization to handle metric fluctuations
- Reduced latency by keeping data closer to its source and destination
- A Prometheus-native architecture for easier extensibility and vendor independence
Our solution
We evaluated several approaches and settled on two key technologies: Grafana Mimir for storage and the OpenTelemetry Collector for data collection.
Grafana Mimir is a scalable time-series database designed for long-term Prometheus storage. It offered us a hosted solution we could quickly evaluate, with an open-source core that allows self-hosting if needed. Native PromQL support and Prometheus HTTP API compatibility made it a natural fit for our Prometheus-native approach.
For collection, we chose the OpenTelemetry Collector, an open-source agent that collects, processes, and exports observability data. We discovered that many observability vendors offer their own collector distributions, but these are typically limited to their own backends. Since we needed an agent that could export data to multiple destinations, we decided to build our own distribution.
The OpenTelemetry Collector architecture consists of receivers (collect data from sources), processors (filter, enrich, transform data), and exporters (send data to backends). This modular design allowed us to support multiple observability platforms without rebuilding data pipelines for each one.
Most importantly, both technologies embrace open standards. This eliminates vendor lock-in and makes it easy to switch tools if needed. Adding new integrations is straightforward since most observability platforms support Prometheus metrics.
Summary
Our legacy centralized monitoring system created scaling challenges, latency issues, and integration limitations that held us back. The introduction of Prometheus support in RabbitMQ gave us the opportunity to redesign our metrics architecture around open standards. We chose Grafana Mimir for storage and built a custom OpenTelemetry Collector distribution for data collection.
In Part 2 of this series, we cover how we extended this approach to our logging infrastructure. Part 3 dives into the full system architecture, dynamic configuration, and the results we achieved.