From Mnesia to Khepri, Part 2: How Khepri Affects you

In the part 1 of this blog series, we gave an overview of Khepri and the motivation behind its creation. This part will cover what to expect when/if you switch to Khepri. In the final part of this blog series, we will cover how to migrate from Mnesia to Khepri on CloudAMQP.

While using RabbitMQ >= 3.13.0 with Khepri disabled, you should see no changes in the behaviour of RabbitMQ. However, with Khepri enabled, there are significant changes you need to be aware of:

Classic mirrored queues are not usable with Khepri – This is because classic mirrored queues heavily depend on Mnesia.
Because Khepri is Raft based, when there is a network partition, only the cluster in the majority side may read and write to the Khepri database and expect consistent results. Consequently, when using Khepri, the majority of nodes must be available for a RabbitMQ cluster to be fully functional.
Because the Khepri database will be used for all kinds of metadata, it means that RabbitMQ nodes that can't write to the database(nodes on the minority side of a partition) will be unable to perform some operations like declaring or modifying queues, exchanges, users, virtual hosts amongst others.
The network partition recovery strategies in RabbitMQ will no longer be required when Khepri is enabled.
Worth noting that some community plugins that are not bundled with RabbitMQ rely on Mnesia and they don’t work with Khepri. Some examples are the Delayed Message Exchange plugin or the Message Deduplication plugin.

Be sure to plan for the above caveats before migrating to Khepri. For example, for high availability, you must switch from classic mirroring to quorum or stream queues before migrating to Khepri.

Furthermore, we were curious to see if there are significant performance differences between Mnesia and Khepri, so we ran some benchmarks. We particularly tested:

Throughput: To see if somehow, switching to Khepri would impact messages sent/received per second.
Queue Churn: To get a sense of how fast RabbitMQ can create/delete queues with Mnesia and then Khepri.
Bind Churn: To get a sense of how fast RabbitMQ can create queue bindings with Mnesia and then Khepri.

Benchmark methodology and tooling

Now, let’s look at the server configuration we used to run the benchmarks for throughput, queue churn and bind churn.

Test bed

For the throughput, queue churn and bind churn, we used three instances of AWS EC2 servers with EBS GP3 drive – precisely, one t4g.medium(2 vCPU, 4GiB RAM) and two m6g.large(2 vCPU, 8GiB RAM).

The t4g.medium EC2 instance is where we ran lavinmqperf, our load generator. Alternatively, you can use PerfTest as well.
The other two EC2 instances have RabbitMQ installed – one with Mnesia and the other with Khepri.

We are running the load generator and RabbitMQ instances on separate machines, simply to reduce the effect of resource sharing. The primary function of the load generator, lavinmqperf is to spin up makeshift producers and consumers – in the case of the throughput test. The load generator is also responsible for creating queues and bindings in the case of queue churn and bind churn benchmarks.

Throughput results

In the throughput test, we had our load generator spin up 2 and 4 producers and consumers respectively, create 4 kilobytes messages, and forward the messages to the broker hosting RabbitMQ with Mnesia. We also tested the same scenario with the broker hosting RabbitMQ with Khepri.

The snippet below shows what the command looks like:

lavinmqperf throughput -z 120 -x 2 -y 4 -s 4000 --uri=amqp://dummy-rabbitmq-server-uri-with-khepri-or-mnesia

In the command above, -z 120 tells lavinmqperf to run the throughput test for 120 seconds. The result of our experiment is summarised in the chart below.

In the chart above, we see Khepri having a slightly higher throughput than Mnesia.

Queue and bind churn

As hinted earlier, the test bed for the queue and bind churn is still the same here– lavinmqperf on the first machine, RabbitMQ with Mnesia on the second machine and RabbitMQ server with Khepri on the third machine.

Queue churn results

We ran the queue-churn benchmark with the command:

lavinmqperf queue-churn --uri=amqp://dummy-rabbitmq-server-uri-with-khepri-or-mnesia

Running the command above will typically return a result in the format below:

create/delete transient queue 130.13  (  7.68ms) (± 6.26%)  187B/op        fastest
create/delete durable queue 126.42  (  7.91ms) (± 6.52%)  327B/op   1.03× slower

As shown in the snippet above, Lavinmqperf will report separate benchmarks for both transient and durable queues. To give us a sense of what each value in the snippet above means:

130.13 is the average number of queues created and deleted per second
7.68ms is the average time to create/delete a queue
6.26% is the time variations in the averages
187B/ops is bytes per operation
“Faster” / “slower” compares the result of transient with durable queues

We tested both Mnesia and Khepri using the queue-churn benchmark. The graph below shows the average number of queues created and deleted per second from our testing.

The graph above shows that Mnesia performs better than Khepri with transient queues, while Khepri outperforms Mnesia with durable queues.

Bind churn results

We also ran the bind-churn benchmark with the command:

lavinmqperf bind-churn --uri=amqp://dummy-rabbitmq-server-uri-with-khepri-or-mnesia

Running the command above will typically return a result in the format below:

bind non-durable queue 417.27  (  2.40ms) (± 1.99%)  192B/op        fastest
bind durable queue 416.19  (  2.40ms) (± 4.20%)  175B/op   1.00× slower

As shown in the snippet above, Lavinmqperf will report separate benchmarks for both transient and durable queues. To give us a sense of what each value in the snippet above means:

417.27 is the average number of queue bindings created per second
2.40ms is the average time taken to create a queue binding
1.99% is the time variations in the averages
192B/op is bytes per operation
“Faster” / “slower”

We ran the bind-churn benchmark for both Mnesia and Khepri; the result of our testing is summarised in the graph below – our graph would only focus on the average number of queue bindings created per second.

In the graph above, we see Mnesia outperforming Khperi significantly with non-durable queues, and having similar results as Khepri with durable queues.

In general, Mnesia performs better with transient entities compared to the durable ones, as seen with queue and bind churn. On the other hand, Khepri shows better performance than Mnesia with durable entities.

Conclusion

Now that you know about Khepri and how it could affect you when you switch, you might be wondering how to migrate from Mnesia to Khepri on CloudAMQP? This will be covered in the next and final blog in this series.

We’d be happy to hear from you! Please leave your suggestions, questions, or feedback in the comment section or contact us at contact@cloudamqp.com