From Mnesia to Khepri, Part 1: Khepri Overview and Motivation

Generally speaking, RabbitMQ persists two categories of data:

The server's metadata– queue and exchange definitions, bindings, users, vhosts, etc.
The actual messages

Since its inception, RabbitMQ has used the Mnesia DB for metadata persistence. However, this will soon change– Khepri will replace Mnesia in RabbitMQ 4.0 and is currently available in RabbitMQ 3.13.0 for testing.

This blog series will explore this new metadata store coming to RabbitMQ. This part will give an overview of Khepri and the motivation behind its creation. Then the next part in this series will cover how Khepri will affect you, and some benchmarks. To begin …

What is Khepri?

Khepri is a database designed for use in Erlang and Elixir environments. It uses a tree-like structure, where data is stored in nodes. See image below.

Each node consists of a node ID, a payload (optional) and properties. Additionally, each node has a path (khepri_path:path()) leading to it. This path is usually a list of node IDs from the root tree node to the target node. For example, the snippet below describes the path from the root node to an oak node:

Path = [orders, stock, wood, <<"oak">>]

Working with Khepri begins with initializing a Khepri store:

{ok, StoreID} = khepri:start("path/to/data_dir", StoreId)

As shown in the snippet above, initializing a Khepri store would return a store ID. In the snippet above, we passed a StoreId to the start() function, but this is optional. When not passed, Khepri will return the default store id – khepri.

Once you’ve initialized a Khepri store, you can then perform basic operations on that store like get, put, delete, etc.

ok = khepri:put(StoreId, [stock, wood, <<<"oak">>], 150),
{ok, 150} = khepri:get(StoreId, [stock, wood, <<"oak">>]),
ok = khepri:delete(StoreId, [stock, wood, <<"oak">>]).

Khepri aims to provide persistent and replicated data storage. It relies on Ra, an Erlang implementation of the Raft consensus algorithm to achieve that. But what is Ra?

While Ra is a topic that’s worthy of a blog on its own, we will try to summarize its fundamentals in a few paragraphs.

Ra: An introduction

We are going to simplify things here.

By being a Raft implementation, Ra allows us to implement persistent and replicated state machines. In the context of Ra, a state machine is simply a process (a Khepri db or Quorum queue, for example) with some current state — this process can receive an input, like a command, apply it to its current state, and eventually produce a new state. If intentionally simplified, you can think of this state machine as a very simple function:

def state_machine(command, current_state):
  return command += current_state

Commonly, a Ra cluster would consist of some state machines replicated across multiple nodes (not always the case), with each replica persisting its state both in memory and on disk. In the Ra parlance, all the replicas in a Ra cluster are called members.

All client applications write data to a “special” member called the leader – think of this leader as the entry point to the cluster. When the leader receives new data, it appends it to its log on disk and then broadcasts the data to all the other cluster members, the followers.

The followers also append the entries they receive from the leader to their logs and send an okay response back to the leader once successful. If the leader receives this okay response from a majority of the members, it then increments its commit index, reflecting this new state in memory as well.

In explaining how Ra works above, we have inadvertently, but understandably, also explained how Raft works at the most basic level. However, keep in mind that we’ve only touched on one component of Raft – log replication. Raft is much more nuanced than that, but understanding replication in Raft is enough to help us grasp how Khepri works – the basics, at least.

Ra and Khepri: The connection

Earlier, we mentioned that to achieve data persistence and replication, Khepri relies on Ra – but what does that even mean?

When you initialize a Khepri store with:

{ok, StoreID} = khepri:start("path/to/data_dir", StoreId)

Under the hood, a Ra cluster with the Khepri store serving as a Ra state machine within it is set up. Interestingly, the Khepri store ID doubles as the Ra cluster name. However, it's important to note that when you initialize the Khepri store, you're essentially creating just one store, and consequently, just one state machine within the Ra cluster – yes, a Ra cluster with just one state machine is still called a cluster.

You can then get the Khepri store you initialize to join a remote Ra cluster with:

ok = khepri_cluster:join(RemoteNode)

Or leave a remote cluster with:

ok = khepri_cluster:reset()

Once you have multiple Khepri stores in a Ra cluster, those stores would rely on the underlying Ra functionality to elect a leader that in turn replicates data to the followers.

Khepri, while independently usable in any project built with Erlang or Elixir, is notably designed with RabbitMQ's needs in mind — particularly in addressing challenges around recovery from network partitions with Mnesia.

Why Khepri: Mnesia and the network partition problem

Recall we stated that Khepri will replace Mnesia in RabbitMQ. But why replace something that already works?

In a cluster with multiple nodes, the Mnesia database will be replicated across all nodes. These replicated Mnesia databases exist in a peer-to-peer arrangement. As a result, in the event of a network partition, clients could keep writing data to individual nodes. When the partition eventually resolves, we end up with inconsistent data across the different nodes.

The default behavior of Mnesia is not to attempt an automatic conflict resolution after a 'partitioned network' event. It detects and reports the condition, but leaves it up to the user to resolve the problem. However, RabbitMQ itself does offer some network partition handling strategies that serve as a workaround for Mnesia’s limitation:

autoheal
pause_minority
pause_if_all_down

While the techniques above work, they are not very robust. Let’s consider a few things that could go wrong with these recovery strategies.

In pause_minority, the node in the minority pauses while the nodes in the majority make progress. When the node in the minority rejoins it just throws away its whole Mnesia DB and copies over the entire DB from another node (not just the newly added records). In some scenarios, this could be very slow, can cause memory/CPU spikes and may even overload that node.
At CloudAMQP, we’ve even seen scenarios with our customers where RabbitMQ does not trigger the configured partition handling strategy at all even when there is a netsplit. Such scenarios usually require some human intervention, like a node redeploy.

Okay, we get it – Mnesia is profoundly flawed, in scenarios involving network partitions but how will Khepri help?

How Khepri simplifies the issues around Mnesia

With Khepri, network partition behaviors are now handled where the partition with at least ((Number of nodes in cluster / 2) + 1) number of nodes can “make progress”. Let’s take the case of a 5-node cluster with 2 partitions: one partition with 3 nodes and the other partition with 2 nodes.

When the leader is on the minority side of the partition

If the leader is on the minority side of the partition (the side with 2 nodes), then it cannot confirm any new writes as it cannot replicate those writes to a majority – think quorum write requirement for all writes in a raft-based system.

The nodes on the majority side (the side with 3 nodes) can elect a new leader, and service can continue. There would be two leaders at this point but only the leader on the majority side is functional. When the partition is resolved, the original leader will send out an AppendEntries message with a term that is now lower than that of the other nodes and it will immediately become a follower.

When the leader is on the majority side of the partition

If a leader is on the majority side of a partition, it can continue to accept writes. The follower(s) on the minority side will stop receiving the leader's heartbeat (AppendEntries) and will become candidates.

But these candidates cannot win a leader election because they cannot get a majority vote– think quorum vote requirement for leader elections in raft-based systems. These candidates will continue to send out RequestVote messages until the partition is resolved at which point they will see that there is already a leader and they will revert to being followers again.

Essentially, when there is a network partition, Khepri will behave in a very predictable way: The nodes in the minority will pause, while the nodes in the majority make progress. After the partition has been resolved, the nodes in the minority will then “catch up”. This inherently eliminates the need for the workaround network partition handling strategies that were needed with Mnesia in RabbitMQ.

Conclusion

To take our understanding of Khepri a step further, in part 2 of this blog series we will explore what to expect if/when you switch to Khepri: Will there be performance differences? Are all the queue types compatible with Khepri? These are some of the questions we will answer in the next part.

We’d be happy to hear from you! Please leave your suggestions, questions, or feedback in the comment section or contact us at contact@cloudamqp.com