RabbitMQ Summit talk recap: Scaling RabbitMQ at Goldman Sachs - Jonathan Skrzypek

RabbitMQ Summit 2018 was a one day conference which brought light to RabbitMQ from a number of angles. Among others, Jonathan Skrzypek talked about why and how Goldman Sachs adopted RabbitMQ.

Goldman Sachs leverages hundreds of applications communicating with each other. The Data Management and Distribution group provide messaging middleware services to the firm’s ecosystem. This talk will be about why and how we adopted RabbitMQ as a first class citizen in our messaging product portfolio. A significant proportion of application teams at Goldman Sachs was used to traditional guaranteed messaging systems, and as such, moving to RabbitMQ was and still is a paradigm shift in how applications interact with a messaging layer. We will touch on the challenges of delivering RabbitMQ as a service at enterprise scale, including but not limited to deployment model, monitoring and telemetry, achieving data consistency, developer awareness.

Slides from this lecture

Scaling RabbitMQ at Goldman Sachs

I’m very happy to be here. Thanks for having me. I'm Jonathan, I work in the technology division at Goldman Sachs.

What I'm going to talk about during this presentation is our journey within RabbitMQ; within Goldman Sachs as a firm and how we initially picked up RabbitMQ as a product… and why we did so… How did we build it into our offering? How do the developers interact with RabbitMQ as a product within the firm; how do we deploy RabbitMQ and how do we manage it?

Just to give a bit of context. I run the messaging engineering team at Goldman Sachs. I know that not many companies necessarily have a dedicated messaging team. For a lot of companies, its application teams actually are running their own broker or their own solution and they sort of fine tune it to their use case… and they are fine, they're happy with it.

But Goldman Sachs will approach things at a scale that is big enough and that will justify having a team of dedicated messaging engineers. What we do is - we provide messaging infrastructure services for four applications across the firm. We engineer a number of different messaging products and we serve them as a platform; we make them easy to use for application developers - at least, as easy as we can.

What is Messaging Engineering?

So as I said, we run products as a - sort of a platform. Over the years we've built a portfolio of products to accommodate various use cases. Across the bank we provide services to our various divisions: it could be trading divisions, risks divisions, banking divisions… and they all have very, very different use cases in different kinds of payloads and different requirements.

We can get into really high volume sort of workloads with really little value attached to message… the high volume, low value… or the extreme opposite, you can have very low volume, but very high value shared in with the message, you know.

What if I told you we have messages going through infrastructure that are worth a couple million dollars? You probably don't want to lose that information.

And because there's no one size fits all, no one product… we built a portfolio of those and we accommodated a myriad of use cases from log shipping to telemetry metrics - all the way to trade booking or payments; excel payments… regulatory flows as well - because we're a regulated institution, so we need to talk with our regulators.

How Goldman Sachs found RabbitMQ - A bit of history

To go back about how we picked RabbitMQ, will go back in time a little bit and go through the history of what we used to have. Well, we used to provide a service; what we still do in some cases.

Back in 2013 (five years ago) we really were leveraging two big families of products… Reliable multicast messaging - that’s your fire - and -forgets fire hose. See? If you've dealt with any messaging engineering in the past, you will know that we don't use reliable as everyone else. Reliable – means, I trust that the network is reliable. The networks should do its job and I'm not really going to try and put any guarantees around this.

And this is really a use case where it's a real fine, forget you want to push as much data as fast as possible with a high degree of Finance; or, one publisher in many, many of subscribers - don't necessarily care if you're losing a couple messages here and there. We just want to know that you've got the latest data as fast as possible, great for streaming prices and you know, live ticking data…Alike stocks, for example.

At the other end of the spectrum, you had the good old GMs blockers for granted messaging. This is your rock solid messaging where you, most of the times - rely on XA transactions with two face commits.

Sometimes you know, you have - let's say a database upstream and a messaging broker downstream; or the other way around… and you want to do a transaction where you know that it’s all or nothing. I want either everything to happen or nothing at all. Most of the time we would use this with under pained synchronous storage replication... So, that's your block storage arrays and your synchronous replication across data centers. That is how we will know that - for sure, 100 percent, the data would be available in two different data centers, for business continuity purposes.

We were thinking… we have the two big things, but it's a bit like going to the ice cream shop and you only have two flavors. Vanilla and chocolate are fine, but you know, could we have something in between?

“Why do we need something new?” - How we did find the “thing’ in between.

Why do we need something new? Well, not everyone needs guaranteed messaging. For the longest time JMS brokers was the only show in town. If you want to have a queue on a topic - and you want to have queue semantics, and there's a lot of use cases that we're seeing using JMSs blockers were really…they didn't really need that. Hardware based guaranteed messaging is costly and not flexible. It is most of the time relying on the metal hosts, costly block storage arrays, costly replication… It is fairly slow. And lastly, we also wanted to dive into open protocols to get into a situation where we're much more relying on a protocol and, an API rather than specific implementation of a broker.

The steps that led us there… Resilient messaging

As I said, two big families: reliable delivery, guaranteed delivery… But we wanted to have a bit extra. We wanted something that was sitting in the middle. We internally coined it as resilient messaging. That's your good for everyday messaging….your good enough messaging; or, as I think RabbitMQ used as a mantra a couple years ago: messaging that just works.

The answers that we found - Why RabbitMQ?

We came across a couple of products. We evaluated a couple of products and we settled on RabbitMQ. Why?

The thing that, you know the most important thing initially - was to have something highly available. Our JMS broker solution was just a single broker. And should the broker fell… then we would do storage failover and if needed, will take a while through our secondary broker to come back up. So, it’s not really highly available. We wanted a cluster solution where losing a node should be really a non-event. What if we have more messages patterns? What if I just, you know, want to have something more than a simple one one queue semantic? Maybe I want to have last value cache. Maybe I want to do some kind of routing and there's been a lot of discussion today around how flexible RabbitMQ can be. And this is what we really wanted to benefit from.

We want the developers to be able to use our infrastructure with any fancy hyped programming language. We wanted to be able to leverage multiple language bindings. And last, but not least. When, you know, you run things at scale, something RabbitMQ is really good at is - it's really easy to observe and easy to manage, so you know what's happening. There's a lot of API that are exposed - either to collect metrics or to configure the brokers.

We knew we wanted to use RabbitMQ - but how? Building an offering

There are many flavors of using RabbitMQ. And this is how we got into the hard part in terms of modeling and usability. So, how do you build an offering? How do you build something that developers want to use? Because again, RabbitMQ is an open source products and nothing prevents our developers from just going minute downloading RabbitMQ and running their own brokers and clusters. And sometimes they still do. Well, they usually come back crawling to us after. But you want to have developers use something because they like it and because you know, it's easy and it's nicely modeled.

No one-size-fits-all. But...

Sadly, there - most of the time, no one size fits all. You can do a lot of different things with RabbitMQ. I wish we had a solution that fits every single use case perfectly, but that doesn't exist.

You can still try to have a template that you can leverage something where, okay, it's not going to fit every single person, but… what if i can get to the 80 percent use case, what if I can get 80 percent coverage where most of the time can deploy a broker, or can deploy a cluster and developers will just be happy and everything is going to work for them.

The result - a deployment model

This is how we progressively got into a standard deployment model. That's the way we deploy RabbitMQ by default at Goldman Sachs. I'm not saying that this is the way you should deploy it...may work, may not work for you. We run three nodes clusters. We prompt them by DNS round-robin.

So, when they resolve the DNS they've been given for a cluster – it resolved to one node in round-robin fashion. They can connect to any nodes. We usually don't co-locate applications in clusters. It's this one cluster per application. If applications once to co-locate, that's because they usually part of the same area or the same business. They bought us some kind of business constraints.

It's not that RabbitMQ can’t scale, but its most of the time for us - a business solution constraint. And we'll let developers choose to have that accessibility. We do not want to dictate – like, this area should always be using the same cluster. We leverage MQP 0-9-1. I'll explain why, later.

And our bias is consistency.. There's, as we've seen today - there are a lot of ways you can use RabbitMQ. But, from our standpoint, we've got a strong bias in consistency across nodes. So, we ran partition handling as pause minority – so that the contents of the messages in the cluster don't diverge across nodes.

A node that is being isolated is going to oppose itself. The clients are then going to fail over to nodes that are still serving requests and we'll use mirrored queues by default - with automatic synchronization by default. And here people normally asks: by default?Isn’t that bad, it's slow etc.

Well, yes it is slow. But, from our standpoint I would much, much rather have an application where developers come to our team and say: “hey, you know what you're saying it's great, but that is a little slow”. And then we have a discussion, but relaxing the constraints…and then we can fine tune and have a discussion with them and make things better.

I would much rather have this situation rather than have any implementation team that comes to us and say - where we relaxed the constraints by default and then come to us and say: “We have actually lost data”. That's much of a tougher sell.

Flexibility - Programmatically define your topology

Let's talk about usability a little bit. We deploy three-node-clusters. But really, what we wanted this offering to be compared to the other products that we're leveraging. We wanted to have something really flexible, something where developers would be in charge of what they were using. And using MQP 0-9-1 and most of the bindings can programmatically define your topology. That's really powerful - because it means that the developer, over time can adjust your topology. Your queue exchanges to fit your business requirements.

I can recommend the RabbitMQ visualizer (https://jmcle.github.io/rabbitmq-visualizer/). It’s a really good website where you can design your topology with routing, binding keys or exchange binding keys etc. And then you can simulate where the payload would go and it's really useful to see how data would flow so that you actually don't have to write any codes. You can literally design your own topology beforehand.

Topology ownership

The way we deploy clusters… those are really an empty shell. We give you a cluster; we will give you a DNS alias, maybe some permission associated with it. I'll come back to that later.

And then as a developer owning an application, you're really in charge. It's up to you to maintain your topology. So, it gives you a lot of flexibility, but on the other side of the coin, it also means that as infrastructure engineers, we expect developers in applications to be able to reconstruct the topology from scratch. That means that if many of you in the room want to use one of our internal clusters and then you start using it, it configures well…And then, if I decided to replace your cluster completely with the same kind of configuration in terms of permissions and number of nodes and etc., I mean, barring that, you wouldn't be happy about losing data.

Then if I gave you a blank cluster, then you should be able to reconstruct the topology from scratch; everything: exchange, binding, queues and etc. That's great because it gives you reproducibility; you know that this is how the application is designed.

This is how producers and consumers are supposed to interact. And this forces you to define your contract with the wall… If you go around and you go in, no matter what interface and you create queues and exchanges manually, then this is just point and click and you're not really spending time thinking and designing it and sort of engraving in stone for lack of a better expression, how your applications should interact.

How do we do that? - Namespacing

There's been a lot of time and we have spent a lot of discussions - in terms of namespacing and how do we model a typology for applications. We introduced a concept called message domain, and so there's a couple of things: a message domain is tied to virtual hosts; it comes with the prefix for queues exchanges. If I give you a prefix, [a.*], you'll be able to create any resource (MQP resource). In that domain, the mirroring and queues synchronization mode. Those are enabled by default in schematic synchronization by default, but we can change it after the fact.

For each domain there's a list of untitled users and those untitled users are modeled and backed by an inventory where we can know, application A / application B from an equities trading or a risk management - actually owns that topology and is entitled to it.

The domain access is read / write / configure… Nothing new here. Those are the RabbitMQ permissions and, and we'll leverage a regular expressions. Let's say I've got a log processing applications and I'm publishing messages with the running key: [warnings.*], [errors.*] Maybe I've got a specific consumers that are specialized just for one ends and just for errors…

But maybe I've got a consuming application that wants to read all of that… and then I will have access to two domains: the warnings domain and their errors domain. Therefore they might just have a read and configure rights to it. You still need to configure it because you need to bind a queue to the exchange. And then the underlying RabbitMQ permission would be one inside star pipe for the logical or operator [error.*]. So, we can actually chain domains like that.

Developer Awareness

I went through a rough deployment model: how do we model it; how do we model typologies?!

Adopting RabbitMQ for Goldman Sachs, was quite a drastic change because first it was the first open source masters in broker in solution that we were leveraging overall. But, it was also a change of paradigm into a whole applications interact with a messaging layer.

Remember, a lot of applications were used to super-fast and fire-and-forget; where I can send like loads of data, and the only thing that matters is how fast my network card can be… or how many traditional guaranteed messaging where… there was not so much to know about. It was just: send a message, do a transaction, wait for the blocking call to return… And that's it. Whereas when you use RabbitMQ, there's a lot of other things that you need to be aware of.

This is how we actually needed to do a lot of discussing, consulting and handholding with applications - so that they could understand how you were supposed to talk to RabbitMQ and how to use it.

Overall, I think the takeaways on average - you have to write smarter applications. You need to listen to what the broker has to tell you; you need to understand the feedback you're getting from the cluster. Duplicates can happen, you know… idempotent consumers, you need to have some kind of sequencing grades or you can have a look at some of the tags and the message. That might - in today’s world sound like a lot of solutions… or at least one… and we're expected to end all those duplicates.

But at the time, I still vividly remember my first few discussions with the application team is they would literally go, what do you mean two times to send message like, I have to do something in my call back. Yes, you do. And progressively implications learned that. Some of them the hard way… Overall, there was no more than one blocking call and you're done… no more!

I called its blocking call and when the call returns, I know my data is in two different data centers. I know my data is safe and etc. You need to embrace the synchronous nature of the design (be it on the consumer side where you can acknowledge that synchronously and use client acts every, I don't know, 100 message, or with 500 messages, or whatever you like).

From the publishing side, from the producer’s side, you need to use publish confirms issue carrier…read your data. Again, you need to listen for what the cluster of the broker has to tell you. Because, if you are just blindly firing and publishing data, you just don’t have an idea how far beyond the mirroring queues are - for example.

Publishers need to keep track

[blockquote] “The client does not perform any internal buffering of such outgoing messages. It is an application developer’s responsibility to keep track of such messages and re-publish them.”

Here I'm just quoting the RabbitMQ.com documentation. Publishers need to keep track. You can see the shift here. We are actually pushing back the complexity and the applications - a little bit and you could picture what it means for a developer that has been using a system like easily for number of years; and suddenly you tell him, well, you're actually going to have to read a little bit more codes. But, you know, it actually works out really well.

Implement, confirm, return, shutdown and recovery of Listeners

Another one developer and ones still on the theme of listening back from the broker. I've stopped counting how many times we've had teams coming to us and saying we've lost data. While no, you haven't lost any data. You just send a message and the broker didn't know what to do with it. He didn't wrote it to any queues… didn't know what to do with. They just dropped it onto the floor. The broker didn't actually lose data. From a publisher standpoint – yes, it might seem that you loosed data, but the worst part is if you don't implement the return listener… you have no idea. Just know that your messages are going to slash null silently. Shut down listeners, recovery listeners, especially in the case of a cluster where nodes going in and nodes going down, failures happen. You need to implement those if you want to have any kind of feedback.

Deploy and Manage

I'm going to go through the deployment and management of the Rabbit estates at Goldman Sachs. We quickly reached a scale where we could no longer install blockers and clusters manually. And we could no longer fiddle with configurations and fine tuning manually. We actually had two generations of provisioning. The first one was pretty crude.

We had a really simple configuration system and then after maybe 20 or 25 clusters, this is one we decided, okay, like - we have to do something now because RabbitMQ is picking up steam internally in Goldman Sachs and a lot of developers want to use it… and we were not scaled as a messaging engineering team to deploy all of those… and you don't want to be able to lack.

If people want to use it, if people want to use something because they find it’s cool, you don't want to be in the middle and say, oh sorry, like you're going to have to wait for a week for a cluster.

Automated provisioning

We've all been intimidated by a provisioning system, we call it a resource manager. This is the way we deploy pretty much every messaging product of Goldman Sachs, RabbitMQ or anything else. We rely on something called GSAppEngine – that is an internally developed software stack that allows us to efficiently deployed processes on hosts. So it's pretty simple actually. You just have to fill in a JS descriptor and you can define, okay, I want to run this package on those hosts, and then just have engine will do everything for you.

On the left you can see that that's a form of hosts. So, we run virtual machines and we're hiring hundreds of them and we've run them in multiple regions, multiple beta centers. We spread them across different hypervisors. And then we run one or multiple blockers for box. Our sort of default is usually 20 to 25 brokers per host.

Sometimes these things get a little busy than we shuffled off all the rounds. And there are a whole sorts of a placement engine associated with that. So what happened is, let's say if it's the first time that we'd be deploying a broker on the host that we just provisioned, well the host is going to start up and he knows it; he actually runs a GSAppEngine host agent - that will query the GSAppEngine engine back and say, okay, what am i supposed to run? And GSAppEngine will dictate – ok, you're supposed to run this!

By this actually, in our case it's not right away a Rabbit blocker. It's a bunch of rappers that will then query our resource manager. And he understands that, looks like I'm supposed to run a RabbitMQ broker; looks at though this as a broker is supposed to join a cluster.

So, it's going to query our inventory which is backed by a database and then it's going to reply with a whole lot of configurations and a lot of different tunable. And this is how it will locally generate the RabbitMQ (dot) config file - with all the settings that were supposed to rent.

And then the broker is going to join the cluster. Sometimes he can get a little bit racy with the cluster in the no discovery… something we are looking at improving with the new improvements in Rabbit 3.7. In terms of placement we were also pretty careful.

If you were to ask for a three nodes cluster in a given data center, we make sure that the multiple brokers for a given cluster are not hosted on VMs that belongs to the same app advisor… because if you lose and they're lying in provider - and you lose the majority of the clusters off, you lose mobility, which completely defeats the purpose of clustering overall…a bit of a shame.

The automated provisioning system is exposed through a number of APIS. We also have a UI, so this is what it looks like. That's a screenshot from our internal UI… so we can point and click create - a Rabbit cluster. We can create what we call the connectivity. So that's to create a domain (the message domain) explained previously. To create the users that go with it, we can actually re-apply RabbitMQ configuration from inventory to the cluster. We can change any kind of RabbitMQ or Erlang runtime parameter.

We can upgrade clusters; we can delete the clusters and so on.

Configuration management

Configuration management… this is where you get into the nitty gritty details of everything. What does it mean to operate multiple clusters in automated fashion… while we arguably have a central inventory… you're supposed to know essentially what it is that you're running, where and how… Now, updating or creating an inventory document the first time, it's fairly easy. The first time we deploy a cluster, you know what's there.

But then what if I told you that people go and fiddle manual - and the configuration never happens, right? While it does… and so what you really aim should be when you're operating things at scale… is how do you minimize deviations? Need to make sure that what you have in the inventory is actually what's deployed because otherwise people are going to make assumptions. And this is where things got really bad.

What do we do is we have continuous checks on inventory versus actual consistency. So, before we make any change, let's say you want to add a user. Now you want to add a node or anything like that… or, we want to upgrade the cluster - before every order is fulfilled. Before we even start touching the cluster, we actually check, we use the RabbitMQ API to pull down the entire configuration - and we compare this to the inventory.

If there's anything different than the request will fail. It doesn't fix it for you, but it's nicely tells you that something is wrong. And in the later versions of this we even have a list of what's wrong tell you or there's users “a”, “b” and “c” in the inventory, but it's only used as “a” and “b” in the cluster. You should probably fix that.

Then you go and either manually use the RabbitMQ API suffix, add stuff on the cluster or maybe its stuff that is missing from the inventory. And then we would go and patch the inventory.

Telemetry is your friend.

I cannot insist enough on that. Don't fly blind. You want to know what's happening. We call, as I said, we can’t babysit every single blocker, every single cluster. So, we collect metrics from the management API. Statistics are pretty neat in RabbitMQ…we collect them every 30 seconds I think.

The management otherwise is really great out of the box. I think it's something a lot of developers use and one of the reasons we really liked RabbitMQ in the first place… but, imagine your interface can only store data for so long. We then push this into an external offering for time series and we have automated dashboard generation which I can show you now.

[Screenshot?]

This is what the dashboard for a RabbitMQ, a big cluster – looks like at Goldman Sachs.

You have system metrics for all the underlying hosts: cluster metrics, broker metrics queues and cluster links as well to see and where the data is flowing… this is all backed by the inventory. So, if we were to add a note to a cluster for it to do any kind of changes, then this would be reflected automatically… just have to refresh the dashboard.

Queue metrics are really useful for developers because then, you know, dramatically they can have a look at the tray number over the last couple of days… and most importantly, a lot of those, lot of those dashboards, I'm sorry, a lot of those metrics… we as a messaging engineering team, we don't necessarily have a look at them too much because you know, what does it mean that consumer utilization on application b in consumer one is 80 percent instead of 20 percent or the other way around? Well, we're not the best place to answer that question. And so this is really meant as a self-service.

Application developers know and the application operations team know where to go and when they have an issue they can look at the dashboard and see what's happening. For any metric that is collected, they can also in a self-service manner, set up any kind of alerting. Queue that impending messages to the queue, maybe 15 messages in this queue are bad, maybe it's not so bad. I wouldn't know.

So, I don't want to be in the loop of putting this together. They just have to put their own rules… consumer relations, queues, break messages and rates etc. This is all on their side. They can do whatever they want with it. That's the good part of it.

The bad part of it is that when they come to us usually means that something is wrong. Actually, no, there's like, we do a lot of filtering, so that's convenient, but then when they, when they reach out to us, then this is now how we know that we're in trouble because we know that we really have to troubleshoot something that might be tricky.

When things go wrong

When things go wrong. I would like to know. I wish I could say that everything is all rainbows and lollipops and everyone is happy and etc. But it isn't. Single queue workloads. That's the typical use cases where an application asks for a cluster. They started using it for a year, even two years, and then it works… it works…and they bought a single queue and then they ring you and say: can we have in a bigger cluster? Can we have a bigger host? And then you respond: What do you mean a bigger cluster? Well, this thing is slow. Okay. But using a single queue is a single core. So if we give you a 32 cores box, that's not going to help you. Even if i was to have nodes - that's not going to help you.

So shard your data, consistent hashing exchanger, sharding a plugin. A lot of good stuff here has been talked about previously. You know, I think it's relatively easy now to shard your data, of course, it’s a lot of questions on ordering, etc. But those can be engineered. Second one: large amounts of pending messages. The classic: “my consumer is slow” or, or “my consumer is actually down” and there's a “fast publisher memory usage”. So, the problem is not so much the memory usage itself. Back when you could engineer, you could say, okay, my broker 20 gigs of memory. Okay, fair enough. We can plan ahead for that. The problem is a swing effect and unpredictable performance. Now, if you are Latin insensitive, that's going to be an issue because you should be publishing really fast and you will swell consumers. Then you're going to have this nasty swing where you're going to fill up the memory and then you're going to hit the memory watermark. And then the broker will say, whoa, okay, calm, keep up, need to swap out to disk where the messages are persistent or non-persistent… I need to swap out to disk… and then it's going to start storing it, starting back with publisher or you've been blocking it to the publisher. So the publisher suddenly goes: “Ok, I can't do anything”. And then RabbitMQ will page everything it needed to disk…and then we are back into memory and then the publisher then resumes. And you've got in terms of message rates - this gives you a nasty swing like that.

In early days, we had a lot of problems with that. It’s mostly good now. We're in the lazy queues. It’s almost as fast as normal queues. Sometimes actually even faster depending on the payload on the workloads. And most importantly, it's predictable performance…mostly predictable. When something is predictable, then it means that you can plan ahead and you can decide, okay, I need one queue, five queues, 10 queues, maybe I need multiple clusters, and maybe I can shard it across clusters. And then it means that you can plan ahead and design your application around all this.

Shut down and startup order. I feel bad but in this, but being transparent… in a RabbitMQ cluster, the last node to go down should be the first node to go up… but when you deal with frustration, if you're doing upgrades, if you think a rolling restart (like if you bring down the whole cluster) you have to be a little careful. And the way that we're deploying and studying a broker is - sometimes we had issues with that. We didn't necessarily lose data because a Rabbit broker will actually refuse to start if he realizes that you studying at first and he wasn't the last one to go down. It's benign because actually the broker will then crash and then our automated process monitoring would then… we start them over and over and over.

It's much better now and in 3.7. I am happy to stand corrected. But I think now the broker is going to come up and then wait for the other ones, which sounds like a much, much nicer way of handling this.

Where are we now?

It's been quite a journey to 225 clusters and nothing in the thousands -I'm afraid. We serve about 170 applications. Some applications use one or more than multiple clusters. We run a mix of 3.6.6. and 3.6.16. We started out with 3.1.3. Then we obviously don't go for every single version. We went for RabbitMQ version 3.1.3, then version 3.3.0 and version 3.5.0 up to version 3.5.7.

We stayed quite a long time on version 3.5.7 and actually still have a couple of dead skeletons in here and there, but it's probably like half a percent. And then physics is physics 16. There's a lot of goodness in 3.7 and we're actually starting engineering it and now. This is not going to be a simple upgrade for us because it's not just RabbitMQ per se, but it's the way we've designed our provisioning system means that we have to rework a lot of that and make it better - so that we can benefit from the new features from a RabbitMQ 3.7.

And it's very important to certify versions beforehand and a lot of testing because when it's out there, then of course we're on the hook for running this at scale and we also usually don't like to have too much of a tail in upgrading. So once we've started certifying a version, then we've got all that all in on it and then we upgraded as fast as possible. We're not; we don't really like running multiple versions.

What's next?

I mentioned there are many provisioning. It's mostly our support teams, messaging support teams or messaging engineering team - that are pushing the buttons and hitting the APIs and hitting the UI. We would like this to be even more lots of a service where application developers can have our APIs.

But, to do this there's a lot of capacity management constraints, a lot of permits and a lot of accountability and tracking that needs to be put in place. And you want to make sure that this thing is rock solid. Because again, once you've put this out there in - developers love APIs. If you give them an API that it can order stuff, they can order everything. Going to order like a 200 nodes clusters. I'm going to order 50 clusters, in all the regions. And so you sort of have to understand what makes sense. You don't want to be dictating, you don't want to constrain developers too much, but you also have to be careful of what's your surface area and what you expose yourself as a team to.

The next point, replication. So, right now we only run in data center clusters. We don't stretch clusters across data centers, let alone across the regions. We don't use any kind of replication. We looked at federation but for exchange of queues. But then, it didn't really do what we wanted to know because then the consumers have to be smart. Ideally we would want something that replicates data and consumers know exactly where they are and this actually states that that is synchronized. And acknowledgements that are fed back from both sides of replication.

Ideally we would like to stretch out clusters, but we can’t because we use supposed minority and in most regions will only have two data centers. So how do you achieve quorum with two data centers? I'll let you do the math. So, maybe something else – we will see. It maybe… you could help us here, I don't know, sounds pretty promising.

Last point: peer discovery. Right now, where configuration is pretty static. As I showed in the previous slide, we generate RabbitMQ that configures templates using the provisioning system, but that means that the list of nodes in the cluster is static. So, you have the cluster section using the Erlang syntax in the conflict file that lists explicitly older nodes. Would like to use something else and that comes with 3.7. We'd like to use another episode three, you know, it could be TCD; could be console. You name it. Looking at that, it's much easier for us to register nodes.

I think that's it for me. By the way: we hiring. Not necessarily my team directly, but I think we do a lot of cool stuff at scale, a lot of different products. So, check it out: Goldman Sachs Careers. We've got a number of job openings so have a look.

Thanks for having me.

[Applause]

Follow @cloudamqp

RabbitMQ Summit talk recap: Scaling RabbitMQ at Goldman Sachs - Jonathan Skrzypek