visit
How many of us have encountered all kinds of CrashLoopBackoff
or other random error messages, and start to go down the Stack Overflow rabbit hole, only to hit a wall?
In our case this occurred with the AssumeRoleWithWebIdentity
that started throwing the InvalidIdentityToken
error when running pipelines with an OIDC provider for AWS. We went through a whole process of researching and ultimately fixing the issue for good, and decided to give a quick runthrough in a single post of how you can do this too.
For us, the service at hand was Github, where the OIDC authentication was configured in Github, to provide our Github repo the required trust relationship to access our AWS account with the specified permissions through temporary credentials, in order to create and deploy AWS resources through our CI/CD process. We basically had a primary environment variable configured ("AWS_WEB_IDENTITY_TOKEN_FILE"), which then tells various tools such as boto3, the AWS CLI or Terraform (based on the relevant pipeline) to perform the AssumeRoleWithWebIdentity and get the designated temporary credentials for the role to perform AWS operations.
The AssumeRoleWithWebIdentity
error manifests itself mostly around parallel access attempts, and how the various AWS interfaces are able to authenticate, as well as run and deploy services. We started encountering this issue when running our pipelines for deployment, and attempting to authenticate our Github account to AWS via the OIDC plugin. This is a well-known (and ) limitation for authentication to AWS for web application providers. In our case it was Github, but this is true for pretty much any web application integration.
At first this error would randomly fail builds, with the InvalidIdentityToken
error, which would only sometimes succeed on reruns. At first we ignored it, and just assumed it was the regular old run of the mill technology failures at scale. But then this started to happen more frequently, and added sufficient friction to our engineering velocity and delivery, that we had to uncover what was happening.
In order to enable parallel access we realized we were missing a critical step in the process to make this possible. The actual recommended order to make this possible would be to retry the AssumeRoleWithWebIdentity
until success, and then set the environment variables upon successful access (based on the these include : AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_SESSION_TOKEN). Another equally critical piece to make this all work was to provide a longer validity window for our token, in our case we set the expiration to 1 hour, and the last and most important part, by performing the retry ourselves.
export AWS_ROLE_ARN=arn:aws:iam::1234567890:role/RoleToAssume
export AWS_WEB_IDENTITY_TOKEN_FILE=/tmp/awscreds
export AWS_DEFAULT_REGION=<region>
export DEFAULT_PARALLEL_JOBS=4
OUTPUT_TOKEN_REQUEST=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" "$ACTIONS_ID_TOKEN_REQUEST_URL")
echo "$OUTPUT_TOKEN_REQUEST" | jq -r '.value' > /tmp/awscreds
RET=1
MAX_RETRIES=5
COUNTER=1
WAIT_FACTOR=2
RUN_ID=$(uuidgen)
until [ ${RET} -eq 0 ]; do
# 5 retries are enough, then fail.
if [ $COUNTER -gt $MAX_RETRIES ]; then
echo "$RUN_ID - Maximum retries of $MAX_RETRIES reached. Returning error."
exit 1
fi
# Try to perform the assume role with web identity
OUTPUT_ASSUME_ROLE=$(aws sts assume-role-with-web-identity --duration-seconds 3600 --role-session-name my_role_name --role-arn $AWS_ROLE_ARN --web-identity-token $(cat /tmp/awscreds) --region $AWS_DEFAULT_REGION)
RET=$?
echo "$RUN_ID - attempt: $COUNTER, assume rule returned code: $RET"
if [ $RET -ne 0 ]; then
echo "$RUN_ID - attempt: $COUNTER - Error happened in assume role, error code - $RET, error msg: $OUTPUT_ASSUME_ROLE. retrying..."
WAIT_FACTOR=$((WAIT_FACTOR*COUNTER))
sleep $WAIT_FACTOR
else
access_key_id="$(echo "$OUTPUT_ASSUME_ROLE" | jq -r '.Credentials.AccessKeyId')"
# Set the AWS environment variables to be used.
export AWS_ACCESS_KEY_ID=$access_key_id
secret_access_key="$(echo "$OUTPUT_ASSUME_ROLE" | jq -r '.Credentials.SecretAccessKey')"
export AWS_SECRET_ACCESS_KEY=$secret_access_key
session_token="$(echo "$OUTPUT_ASSUME_ROLE" | jq -r '.Credentials.SessionToken')"
export AWS_SESSION_TOKEN=$session_token
fi
COUNTER=$((COUNTER+1))
done
# Perform any calls to AWS now - the 3 environment variables will take precedence over AWS_WEB_IDENTITY_TOKEN_FILE