visit
If you just want to see the code - see github.
In certain applications, it is required to run an LLM on our own. You may want to process sensitive data (medical records or legal documents) or get great quality output in a language different than English. Sometimes, you have a specialized task that doesn’t require expensive big models from OpenAI.
We will download the model from Huggingface and run it via llama-cpp-python
package (bindings to the popular llama.cpp, heavily optimized model CPU runtime). We will use a smaller, quantized version, but even the full one should fit in Lambda memory.
PROMPT="Create five questions for a job interview for a senior python software engineer position."
curl $LAMBDA_URL -d "{ \"prompt\": \"$PROMPT\" }" \
| jq -r '.choices[0].text, .usage'
Instruct: Create five questions for a job interview for a senior python software engineer position.
Output: Questions:
1. What experience do you have in developing web applications?
2. What is your familiarity with different Python programming languages?
3. How would you approach debugging a complex Python program?
4. Can you explain how object-oriented programming principles can be applied to software development?
5. In a recent project, you were responsible for managing the codebase of a team of developers. Can you discuss your experience with this process?
{
"prompt_tokens": 21,
"completion_tokens": 95,
"total_tokens": 116
}
First, we will need a basic Python lambda function handler. In your project folder, create a file called lambda_function.py
.
import sys
def handler(event, context):
return "Hello from AWS Lambda using Python" + sys.version + "!"
We will also create a requirements.txt
file, in which we specify our dependencies. Let’s start with the AWS library for interacting with their services, just as an example.
boto3
Then, we need to specify our docker image composition in Dockerfile
file. Comments document what each line does.
FROM public.ecr.aws/lambda/python:3.12
# Copy requirements.txt
COPY requirements.txt ${LAMBDA_TASK_ROOT}
# Install the specified packages
RUN pip install -r requirements.txt
# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}
# Set the CMD to your handler
CMD [ "lambda_function.handler" ]
Finally, we will create a docker-compose.yml
file to make our life easier when running and building the container.
version: '3'
services:
llm-lambda:
image: llm-lambda
build: .
ports:
- 9000:8080
docker-compose up
curl "//localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
To run an LLM, we need to add llama-cpp-python
to our requirements.txt
.
boto3
llama-cpp-python
To build it, we need to introduce a . This is because the default Amazon docker image doesn’t include the build tools required for llama-cpp. We are doing a pip install with CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
flags in order to use multi-threaded optimizations.
The code below also includes downloading the model. Community hero, the Bloke, is sharing compressed (quantized, less computationally intensive but lower quality) versions of models, ready to download . To make use of them, we can install the huggingface CLI and run the appropriate commands. You can switch the repository (TheBloke/phi-2-GGUF
) and the model (phi-2.Q4_K_M.gguf
) to whatever you like if you want to deploy a different model.
RUN pip install huggingface-hub && \
mkdir model && \
huggingface-cli download TheBloke/phi-2-GGUF phi-2.Q4_K_M.gguf --local-dir ./model --local-dir-use-symlinks False
# Stage 1: Build environment using a Python base image
FROM python:3.12 as builder
# Install build tools
RUN apt-get update && apt-get install -y gcc g++ cmake zip
# Copy requirements.txt and install packages with appropriate CMAKE_ARGS
COPY requirements.txt .
RUN CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install --upgrade pip && pip install -r requirements.txt
# Stage 2: Final image using AWS Lambda Python image
FROM public.ecr.aws/lambda/python:3.12
# Install huggingface-cli and download the model
RUN pip install huggingface-hub && \
mkdir model && \
huggingface-cli download TheBloke/phi-2-GGUF phi-2.Q4_K_M.gguf --local-dir ./model --local-dir-use-symlinks False
# Copy installed packages from builder stage
COPY --from=builder /usr/local/lib/python3.12/site-packages/ /var/lang/lib/python3.12/site-packages/
# Copy lambda function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}
CMD [ "lambda_function.handler" ]
import base64
import json
from llama_cpp import Llama
# Load the LLM, outside the handler so it persists between runs
llm = Llama(
model_path="./model/phi-2.Q4_K_M.gguf", # change if different model
n_ctx=2048, # context length
n_threads=6, # maximum in AWS Lambda
)
def handler(event, context):
print("Event is:", event)
print("Context is:", context)
# Locally the body is not encoded, via lambda URL it is
try:
if event.get('isBase64Encoded', False):
body = base64.b64decode(event['body']).decode('utf-8')
else:
body = event['body']
body_json = json.loads(body)
prompt = body_json["prompt"]
except (KeyError, json.JSONDecodeError) as e:
return {"statusCode": 400, "body": f"Error processing request: {str(e)}"}
output = llm(
f"Instruct: {prompt}\nOutput:",
max_tokens=512,
echo=True,
)
return {
"statusCode": 200,
"body": json.dumps(output)
}
docker-compose up --build
curl "//localhost:9000/2015-03-31/functions/function/invocations" \
-d '{ "body": "{ \"prompt\": \"Generate a good name for a bakery.\" }" }'
Instruct: Generate a good name for a bakery.
Output: Sugar Rush Bakery.
{
"prompt_tokens": 15,
"completion_tokens": 6,
"total_tokens": 21
}
Initial Setup: Determine necessary information such as AWS region, ECR repository name, Docker platform, and IAM policy file, and disable AWS CLI pager.
Verify AWS Configuration: Optionally confirm that the AWS CLI is correctly configured.
ECR (Elastic Container Registry) Repository Management:
IAM Role Handling:
If not found, establish the IAM role and apply the AWSLambdaBasicExecutionRole
policy.
Docker-ECR Authentication: Securely log Docker into the ECR registry using the retrieved login credentials.
Docker Image Construction: Utilize Docker Compose to build the Docker image, specifying the desired platform.
ECR Image Tagging: Label the Docker image appropriately for ECR upload.
ECR Image Upload: Transfer the tagged Docker image to the ECR.
Acquire IAM Role ARN: Fetch the ARN linked to the specified IAM role.
Lambda Function Verification: Assess whether the Lambda function exists.
Lambda Function Configuration: Set parameters like timeout, memory allocation, and image URI for Lambda.
Lambda Function Deployment/Update:
Function URL Retrieval: Obtain and display the Function URL of the Lambda function.
#!/bin/bash
# Variables
AWS_REGION="eu-central-1"
ECR_REPO_NAME="llm-lambda"
IMAGE_TAG="latest"
LAMBDA_FUNCTION_NAME="llm-lambda"
LAMBDA_ROLE_NAME="llm-lambda-role" # Role name to create, not ARN
DOCKER_PLATFORM="linux/arm64" # Change as needed, e.g., linux/amd64
IAM_POLICY_FILE="trust-policy.json"
PAGER= # Disable pager for AWS CLI
# Authenticate with AWS
aws configure list # Optional, just to verify AWS CLI is configured
# Check if the ECR repository exists
REPO_EXISTS=$(aws ecr describe-repositories --repository-names $ECR_REPO_NAME --region $AWS_REGION 2>&1)
if [ $? -ne 0 ]; then
echo "Repository does not exist. Creating repository: $ECR_REPO_NAME"
# Create ECR repository
aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION
else
echo "Repository $ECR_REPO_NAME already exists. Skipping creation."
fi
# Check if the Lambda IAM role exists
ROLE_EXISTS=$(aws iam get-role --role-name $LAMBDA_ROLE_NAME 2>&1)
if [ $? -ne 0 ]; then
echo "IAM role does not exist. Creating role: $LAMBDA_ROLE_NAME"
# Create IAM role for Lambda
aws iam create-role --role-name $LAMBDA_ROLE_NAME --assume-role-policy-document file://$IAM_POLICY_FILE
aws iam attach-role-policy --role-name $LAMBDA_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
else
echo "IAM role $LAMBDA_ROLE_NAME already exists. Skipping creation."
fi
# Get login command from ECR and execute it to authenticate Docker to the registry
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $(aws sts get-caller-identity --query Account --output text).dkr.ecr.$AWS_REGION.amazonaws.com
# Build the Docker image using Docker Compose with specific platform
DOCKER_BUILDKIT=1 docker-compose build --build-arg BUILDPLATFORM=$DOCKER_PLATFORM
# Tag the Docker image for ECR
docker tag llm-lambda:latest $(aws sts get-caller-identity --query Account --output text).dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO_NAME:$IMAGE_TAG
# Push the Docker image to ECR
docker push $(aws sts get-caller-identity --query Account --output text).dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO_NAME:$IMAGE_TAG
# Get the IAM role ARN
LAMBDA_ROLE_ARN=$(aws iam get-role --role-name $LAMBDA_ROLE_NAME --query 'Role.Arn' --output text)
# Check if Lambda function exists
FUNCTION_EXISTS=$(aws lambda get-function --function-name $LAMBDA_FUNCTION_NAME --region $AWS_REGION 2>&1)
# Parameters for Lambda function
LAMBDA_TIMEOUT=300 # 5 minutes in seconds
LAMBDA_MEMORY_SIZE=10240 # Maximum memory size in MB
LAMBDA_IMAGE_URI=$(aws sts get-caller-identity --query Account --output text).dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO_NAME:$IMAGE_TAG
# Deploy or update the Lambda function
if echo $FUNCTION_EXISTS | grep -q "ResourceNotFoundException"; then
echo "Creating new Lambda function: $LAMBDA_FUNCTION_NAME"
aws lambda create-function --function-name $LAMBDA_FUNCTION_NAME \
--region $AWS_REGION \
--role $LAMBDA_ROLE_ARN \
--timeout $LAMBDA_TIMEOUT \
--memory-size $LAMBDA_MEMORY_SIZE \
--package-type Image \
--architectures arm64 \
--code ImageUri=$LAMBDA_IMAGE_URI
aws lambda create-function-url-config --function-name $LAMBDA_FUNCTION_NAME \
--auth-type "NONE" --region $AWS_REGION
# Add permission to allow public access to the Function URL
aws lambda add-permission --function-name $LAMBDA_FUNCTION_NAME \
--region $AWS_REGION \
--statement-id "FunctionURLAllowPublicAccess" \
--action "lambda:InvokeFunctionUrl" \
--principal "*" \
--function-url-auth-type "NONE"
else
echo "Updating existing Lambda function: $LAMBDA_FUNCTION_NAME"
aws lambda update-function-code --function-name $LAMBDA_FUNCTION_NAME \
--region $AWS_REGION \
--image-uri $LAMBDA_IMAGE_URI
fi
# Retrieve and print the Function URL
FUNCTION_URL=$(aws lambda get-function-url-config --region $AWS_REGION --function-name $LAMBDA_FUNCTION_NAME --query 'FunctionUrl' --output text)
echo "Lambda Function URL: $FUNCTION_URL"
chmod +x deploy.sh
./deploy.sh
# AWS_PROFILE=your_profile ./deploy.sh
# if you want to use a different profile than default
PROMPT="Create five questions for a job interview for a senior python software engineer position."
curl $LAMBDA_URL -d "{ \"prompt\": \"$PROMPT\" }" \
| jq -r '.choices[0].text, .usage'
chmod +x test_remote.sh
LAMBDA_URL=your_url ./test_remote.sh
After that, it consistently took 9 seconds to generate an output, tested in 4 subsequent runs. The average output length was 83 tokens. So, we achieved ~9.2 tokens/second when running hot.
You could run this prompt 4444 times in a free tier. Above that, it would cost $1.2 per 1000 runs. The same 1000 runs with a GPT-3.5 turbo would cost $0.2. This makes this endeavor not really cost-effective, so low cost shouldn’t be your aim if you’re going to implement this in production.
After changing the number of threads to the proper 3, the execution almost timed out at a cold start, returning results after 4 minutes and 33 seconds; the subsequent 3 runs took 26-40 seconds, averaging 3.15 tokens/second. 1000 paid runs would cost $1.45. So we got slower and paid more, not a good optimisation! With this in mind, it makes sense to pay for more memory because AWS assigns CPU resources proportionally to the reserved RAM amount.
Subscribe to my profile by filling in your email address on the left, and be up-to-date with my articles!
Also published .