Imagine that you have a web application that lets customers upload a text document that the application converts into a PDF and emails back to the customer. This site receives a lot of PDF creation requests every second and every request takes a second to process - some sort of queue is needed to be able to handle all requests.
- Customer uploads a text document
- Your application converts the text document into a PDF
- Your application emails the PDF back to the customer
Database for a queue based system
You might, in this kind of situation, consider using a database for your PDF job queue. Typically you would create a database table which has a queue with records that represent PDF requests and then a flag in the table representing which state the task is in and whether the task is completed or not.
INSERT INTO pdf_job_queue (name, status, email) VALUES ("White paper", "NEW", "firstname.lastname@example.org");
You need to write code to insert the new requests into the database. Code that takes an input from the database, perhaps change a status column, with values such as "NEW" and "PROCESSING", code that handles the request, code that again updates the database status field to "FINISHED" or code to remove the request from the queue.
A message queue is built to handle this kind of scenario, and it is easy to queue or dequeue messages from the queue.
Another thing that you should keep in mind when you build a solution where you let your database act as a queue, is that your application has to ask the server over and over again for the latest queued requests. Messages that have already been consumed need to be filtered out. The number of results received might need to be limited.
SELECT * FROM pdf_job_queue WHERE queue = 'resize_queue' AND handled = false ORDER BY date_recived limit 1;
To be most effective you might need to poll the database very fast and do frequent queries to the database table. Of course this way of frequently asking the database adds a significant load to the database.
The database table is both read and write (insert requests to the queue and read requests from the queue). Usually it is a quick operation to add data, update data or querying data from a table, but now you will have to deal with all of them in the same table. Queuing/dequeuing will impact each other under heavy loadcause lock conditions, transaction deadlocks, IO timeouts, etc.
Note: As long as you don't have very high performance throughput requirements and if you do not have to communicate with any other systems then a database will probably work just fine for you.
Now imagine that your web application is getting used by more and more customers. You have to queue many PDF creation requests in your database and you want to be able to process more requests per second concurrently. You can connect more workers (processes that handles the requests) or you can scale your solution by adding more machines into your pool of resources or by adding more power (CPU, RAM) to your existing machine to be able to handle the high load. You decide to connect more workers to your database.
Lock on messages and manual handling of the complexity
When a second worker is added, there is a possibility that the two workers will try to handle the same request at the same time. You need to implement some kind of row lock on the queue so that the same request never gets picked up by concurrent receivers.
Each worker needs to acquire some sort of a lock on the request. Once the exclusive lock has been acquired for the row the system needs to handle the update (e.g. update a status to "PROCESSING"), and after that the lock will need to be released (e.g. by committing the transaction) as soon as possible so that other workers can access the queue too.
You can also imagine what happens when one of your workers gets a network issue and can not process the request. The request has to be added back to the queue, the status has to be changed back somehow or you have to use transactions and lock it until you have processed the document - meaning that you will lock the queue for other workers.
All this complexity will have to be managed by the programmer - code has to be written. A decent message queue will take care of a lot of this automatically and you don't have to worry about any deadlocks or race conditions (when your system attempts to perform two or more operations at the same time).
If you don’t clean up your database table regulary or delete requests as soon as they have been handled, your database will keep growing until it starts to hit the performances of your database. Eventually it will become a serious operation problem for you.
The locking challanges and the scan process to find more work will make you run into scalability hang-ups and it will typically also slow down your database.
We have now talked a lot of the limitations that you might notice if you use a database for a queue based system (if the system is heavily used or when the system most likely needs to scale up when it is growing). We have talked about the constant polling of messages from the queue, the manual cleanups of messages, deadlocks, race conditions and failed processing that has to be handled manually. All these makes it fairly clear that a database table is not the right tool for a queue based system. It's time to understand when and why the message queue shines!
A message queue only needs to ensure that a message is stored and successfully processed, it doesn't need to bother with locking and writing transactions. You can receive your PDF processing request and let the worker do all the processing, the worker can then pop next message of the queue. Message queue scaling is much easier than scaling databases - just add another worker when the queue starts to fill up. If you get to many connections or something else that need a lot CPU - add more CPU power to your message queue server. In a message queue we don’t need to worry about any race conditions or lockes keeping you out of the system.
Messages from a message queue are pushed in real-time instead of periodically polled from a database. We can efficiently support a significantly higher volume of concurrent messages with a message queue. Messages in a message queue are automatically cleaned up after being received.
You may wonder what happens if one of the workers starts to handle a request and dies when it is partly finished. If a worker dies, we would like the task to be delivered to another worker or held in the queue until the worker is ready to receive the request again. In order to make sure a message is never lost, a good message queue supports message acknowledgments. An acknowledgement is sent back from the worker to tell the message queue that a particular message had been received, processed and that the message queue is free to delete it. If a worker dies without sending an acknowledgement, the message queue will understand that a message wasn't processed fully and will redeliver it to the queue and after that to another worker. That way you can be sure that no message is lost.
Databases are great for storing information, like product information about a product that you sell on a web shop. It is great when you need to ask questions about the data or sort it in different categories. Using a message queue are highly recommended when you need to process a high-volume of asynchronous messages.
A message queue is of perfect use when you want a high performance, highly concurrent and scalable system that can process thousands of messages per second concurrently across many servers/processes!
If you feel that it's about time to get started with your first service using a message queue, check out our article: RabbitMQ for beginners, what is RabbitMQ?
Please email us at if you have any suggestions or feedback.