visit
Suddenly, someone's product goes viral, and a lot of people rush in to buy it at the same time. Our cache and CDN don't even blink at the traffic, and our well-designed Lambdas scale amazingly fast, but our DynamoDB table is suddenly bombarded with writes and the auto-scaling can't keep up. Our order processing Lambda receives ProvisionedThroughputExceededException
, and when it retries it just makes everything worse. Things crash. Sales are lost. We eventually recover, but those customers are gone. How do we make sure it doesn't happen again?
AWS Services involved:
DynamoDB: Our database. All you need to know for this post is .
SQS: A fully managed message queuing service that enables you to decouple components. Producers like our order processing microservice post to the queue, the queue stores these messages until they're read, and consumers read from the queue in their own time.
SES: An email platform, more similar to services like MailChimp than to an AWS service. If you're already on AWS and you just need to send emails programmatically, it's easy to set up. If you're not on AWS, need more control, or need to send so many emails that price is a factor, you'll need to do some research. For this post, SES is good enough.
The queue stores the messages for a certain amount of time, and when Consumers are ready, they poll the queue for messages and receive the oldest message.
There are two types of queues in SQS:
Standard queues are the default type of queue. They're cheaper than FIFO queues and nearly infinitely scalable. The tradeoff is that they only guarantee at-least-once delivery (meaning you might get duplicates), and the order of the messages is mostly respected but not guaranteed.
FIFO queues are more expensive than Standard queues, and they don't scale infinitely, but they guarantee ordered exactly-once delivery. You need to set the MessageGroupId
property in the message since FIFO queues only deliver the next message in a MessageGroup after the previous message has been successfully processed. For example, if you set the value of MessageGroupId
to the customer ID and a customer makes two orders at the same time, the second one to come in won't be processed until the first one is finished processing. It's also important to set MessageDeduplicationId
, to ensure that if the message gets duplicated upstream, it will be deduplicated at the queue. A FIFO queue will only keep one message per unique value of MessageDeduplicationId.
Follow these step-by-step instructions to implement an SQS Queue to throttle writes to a DynamoDB table. Replace YOUR_ACCOUNT_ID
and YOUR_REGION
with the appropriate values for your account and region.
This is what the code looks like:
const AWS = require('aws-sdk');
const sqs = new AWS.SQS();
const queueUrl = '//sqs.YOUR_REGION.amazonaws.com/YOUR_ACCOUNT_ID/OrdersQueue';
async function processOrder(order) {
const params = {
MessageBody: JSON.stringify(order),
QueueUrl: queueUrl,
MessageGroupId: order.customerId,
MessageDeduplicationId: order.id
};
try {
const result = await sqs.sendMessage(params).promise();
console.log('Order sent to SQS:', result.MessageId);
} catch (error) {
console.error('Error sending order to SQS:', error);
}
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
}
]
}
Add the following code:
const AWS = require('aws-sdk');
const dynamoDB = new AWS.DynamoDB.DocumentClient();
const ses = new AWS.SES();
exports.handler = async (event) => {
for (const record of event.Records) {
const order = JSON.parse(record.body);
await saveOrderToDynamoDB(order);
await sendEmailNotification(order);
}
};
async function saveOrderToDynamoDB(order) {
const params = {
TableName: 'orders',
Item: order
};
try {
await dynamoDB.put(params).promise();
console.log(`Order saved: ${order.orderId}`);
} catch (error) {
console.error(`Error saving order: ${order.orderId}`, error);
}
}
async function sendEmailNotification(order) {
const emailParams = {
Source: '[email protected]',
Destination: {
ToAddresses: [order.customerEmail]
},
Message: {
Subject: {
Data: 'Your order is ready'
},
Body: {
Text: {
Data: `Thank you for your order, ${order.customerName}! Your order #${order.orderId} is now ready.`
}
}
}
};
try {
await ses.sendEmail(emailParams).promise();
console.log(`Email sent: ${order.orderId}`);
} catch (error) {
console.error(`Error sending email for order: ${order.orderId}`, error);
}
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
],
"Resource": "arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem"
],
"Resource": "arn:aws:dynamodb:YOUR_REGION:YOUR_ACCOUNT_ID:table/Orders"
},
{
"Effect": "Allow",
"Action": "ses:SendEmail",
"Resource": "*"
}
]
}
Architecture-wise, there's one big change in our solution: We've made our workflow async! Let me bring the diagram here.
Before, our Orders service would return the result of the order. From the user's perspective, they wait until the order is processed, and they see the result on the website. From the system's perspective, we're constrained to either succeed or fail processing the order in the timeout limit of API Gateway (29 seconds). In more practical terms, we're limited by what the user is expecting: we can't just show a "loading" icon for 29 seconds!
After the change, the website just shows something like "We're processing your order, we'll email you when it's ready". That sets a different expectation to the user. That's important for the system, because now we could actually have our Lambda function take 15 minutes, without hitting the 29 seconds limit of API Gateway, or without the user getting angry. It's not just that though, if the Order Processing lambda crashes mid-execution, the SQS queue will make the order available again as a message after the visibility timeout expires, and the Lambda service will invoke our function again with the same order. When the maxReceiveCount limit is reached, the order can be sent to another queue called Dead Letters Queue (DLQ), where we can store failed orders for future reference. We didn't set up a DLQ here, but it's easy enough, and for small and medium-sized systems you can easily set up SNS to send you an email and resolve the issue manually, since the volume shouldn't be particularly large.