visit
This is the last of a 3-part mini series on managing your AWS Lambda logs. If you haven’t read yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.
part 1 :
part 2:
Microservice death stars, circa 2015. When everything works, it’s like watching an orchestra, lots of small pieces all acting independently whilst at the same time collaborating to form a whole that’s greater than the sum of its parts. However, when things don’t work, it’s a pain in the ass to debug. Finding that one clue is like finding needle in the haystack as there are so many moving parts, and they’re all constantly moving. Imagine you’re an engineer at Twitter and trying to debug why a user’s tweet was not delivered to one of his followers’ timeline.
“Let me cross reference the logs from hundreds of services and find the logs that mention the author’s user ID, the tweet ID, or the recipient’s user ID, and put together a story of how the tweet flowed through our system and why it wasn’t delivered to the recipient’s timeline.”
“What about logs that don’t explicitly mention those fields?”
“mm… let me get back to you on that…”
Needle in the haystack.This is the problem that correlation IDs solve in the microservice world — to tag every log message with the relevant context so that it’s easy to find them later on. Aside from common IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log message. That way, if you’re then you can use it to quickly load up the relevant trace in the X-Ray console.
By default, Lambda automatically generates a _X_AMZN_TRACE_ID value in the environment variable.
Also, if you’re going to add a bunch of correlation IDs to every log message then you should consider switching to JSON. Then you need to update the ship-logs
function we introduced in part 1 to handle log messages that are formatted as JSON.
“Darn it, now we have to enable debug logging and redeploy all these services! What a pain!”
“Don’t forget to disable debug logging and redeploy them, after you’ve found the problem ;-)”
Fortunately it doesn’t have to be a catch-22 situation. You can enable DEBUG logging on the entire call chain by:
The edge service decides to turn DEBUG logging on for 5% of requests, that decision is captured and passed along throughout the entire call chain, through HTTP requests, SNS message and Kinesis events.
The demo project consists of an edge API, api-a, which initialises a bunch of correlation IDs as well as the decision on whether or not to turn on debug logging. It’ll pass these along through HTTP requests to api-b, Kinesis events and SNS messages. Each of these downstream function would in turn capture and pass them along to api-c.
To abstract away the implementation details, let’s create a requestContext
module that makes it easy to fetch and update these context data:
And then add a log
module which:
Debug-Log-Enabled
field was captured in the incoming request alongside other correlation IDs
Notice that I have also captured the User-Agent
from the incoming request, as well as the decision to not enable DEBUG logging.
When you need to implement another HTTP endpoint, pass your handler code to this factory function. Now, with minimal change, all your logs will have the captured correlation IDs (as well as User-Agent
, whether to enable debug logging, etc.).
The api-a
function in our earlier architecture looks something like this:
Since this is the API on the edge, so it initialises the x-correlation-id
using the AWS Request ID for its invocation. This, along with several other pieces of contextual information is recorded with every log message.
By adding a custom HTTP module like , you can also make it easy to include these contextual information in outgoing HTTP requests. Encapsulating these conventions in an easy-to-use library also helps you standardise the approach across your team.
In the api-a
function above, we made a HTTP request to the api-b
endpoint. Looking in the logs, you can see the aforementioned contextual information has been passed along.
In this case, we also have the User-Agent
from the original user-initiated request to api-a
. This is useful because when I look at the logs for intermediate services, I often miss the context of what platform the user is using which makes it harder to correlate the information I gather from the logs to the symptoms the user describes in their bug reports.
When the api-b
function (see ) makes its own outbound HTTP request to api-c
it’ll pass along all of these contextual information plus anything we add in the api-b
function itself.
Log message for when api-b calls api-c with the custom HTTP module. Notice it includes the “x-correlation-character-b” header which is set by the api-b function.
When you see the corresponding log message in api-c
’s logs, you’ll see all the context from both api-a
and api-b
.
In the api-a
function above, we also published a message to SNS (omitted from the code snippet above) with a custom sns
module which includes the captured correlation IDs as message attributes, see below.
When this SNS message is delivered to a Lambda function, you can see the correlation IDs in the MessageAttributes
field of the SNS event.
Let’s create a snsHandler
factory function to standardise the process of capturing incoming correlation IDs via SNS message attributes.
We can use this factory function to quickly create SNS handler functions. The log messages from these handler functions will have access to the captured correlation IDs. If you use the aforementioned custom http
module to make outgoing HTTP requests then they’ll be included as HTTP headers automatically.
For instance, the following SNS handler function would capture incoming correlation IDs, include them in log messages, and pass them on when making a HTTP request to api-c
(see architecture diagram).
Log message for when the SNS handler function calls api-c with the custom HTTP module. Notice it includes the “x-correlation-source-type” header which is set by the SNS handler function.
Those correlation IDs (including the one added by the SNS handler function) are included as HTTP headers.
Let’s create a kinesis
module for sending events to a Kinesis stream, so that we can insert a __context
field to the payload to carry the correlation IDs.
On the receiving end, we can take it out, use it to set the current requestContext
, and delete this __context
field before passing it on to the Kinesis handler function for processing. The sender and receiver functions won’t even notice we modified the payload.
The simplest way is to force the handler function to process records one at a time. That’s what we’ve done in the kinesisHandler
factory function here.
The handler function (created with the kinesisHandler
factory function) would process one record at at time, and won’t have to worry about managing the request context. All of its log messages would have the right correlation IDs, and outgoing HTTP requests, SNS messages and Kinesis events would also pass those correlation IDs along.
Log message for when the kinesis handler function calls api-c.
When api-c receives the invocation event, you can see the correlation IDs have been passed along via HTTP headers. This approach is simple, developers working on Kinesis handler functions won’t have to worry about the implementation details of how correlation IDs are captured and passed along, and things “just work”. However, it also removes the opportunity to optimize by processing all the records in a batch. Perhaps your handler function has to persist the events to a persistence store that’s better suited for storing large payloads rather than lots of small ones.
This simple approach is not the right fit for every situation, an alternative would be to leave the __context
field on the Kinesis records and let the handler function deal with them as it sees fit. In which case you would also need to update the shared libraries — the log
, http
, sns
and kinesis
modules we have talked about so far — to give the caller to option to pass in a requestContext
as override.
You can find a deployable version of the code you have seen in this post in this . It’s intended for demo sessions in my O’Reilly course detailed below, so documentation is seriously lacking at the moment, but hopefully this post gives you a decent idea of how the project is held together.
Hi, my name is Yan Cui. I’m an and the author of . I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via , and . Check out my new course, . In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices. Get your copy .
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy .