visit
Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.In this post, I show how we can use AWS Textract to extract text from scanned pdf files.
Before getting started, Install the and creates an application with sample code using
sam init -r nodejs12.x
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Timeout: 60
Parameters:
Stage:
Type: String
Default: dev
BucketName:
Type: String
Default: aiyi.demo.textract
Resources:
TextractSNSTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: !Sub "textract-sns-topic"
TopicName: !Sub "textract-sns-topic"
Subscription:
- Protocol: lambda
Endpoint: !GetAtt TextractEndFunction.Arn
TextractSNSTopicPolicy:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref TextractEndFunction
Principal: sns.amazonaws.com
Action: lambda:InvokeFunction
SourceArn: !Ref TextractSNSTopic
TextractEndFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: handler.textractEndHandler
Runtime: nodejs12.x
Role: !GetAtt TextractRole.Arn
Policies:
- AWSLambdaExecute
- Statement:
- Effect: Allow
Action:
- "s3:PutObject"
Resource: !Join [":", ["arn:aws:s3::", !Ref BucketName]]
TextractStartFunction:
Type: AWS::Serverless::Function
Properties:
Environment:
Variables:
TEXT_EXTRACT_ROLE: !GetAtt TextractRole.Arn
SNS_TOPIC: !Ref TextractSNSTopic
Role: !GetAtt TextractRole.Arn
CodeUri: src/
Handler: handler.textractStartHandler
Runtime: nodejs12.x
Events:
PDFUploadEvent:
Type: S3
Properties:
Bucket: !Ref S3Bucket
Events: s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: suffix
Value: ".pdf"
TextractRole:
Type: AWS::IAM::Role
Properties:
RoleName: "TextractRole"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Principal:
Service:
- "textract.amazonaws.com"
- "lambda.amazonaws.com"
Action:
- "sts:AssumeRole"
ManagedPolicyArns:
- "arn:aws:iam::aws:policy/AWSLambdaExecute"
Policies:
- PolicyName: "TextractRoleAccess"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "sns:*"
Resource: "*"
- Effect: Allow
Action:
- "textract:*"
Resource: "*"
GetTextractResult:
Type: AWS::Serverless::Function
Properties:
Role: !GetAtt TextractRole.Arn
CodeUri: src/
Handler: handler.getTextractResult
Runtime: nodejs12.x
Events:
TextExactStart:
Type: HttpApi
Properties:
Path: /textract
Method: post
MyHttpApi:
Type: AWS::Serverless::HttpApi
Properties:
StageName: !Ref Stage
Cors:
AllowMethods: "'OPTIONS,POST,GET'"
AllowHeaders: "'Content-Type'"
AllowOrigin: "'*'"
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref BucketName
exports.textractStartHandler = async (event, context, callback) => {
try {
const bucket = event.Records[0].s3.bucket.name;
const key = event.Records[0].s3.object.key;
const params = {
DocumentLocation: {
S3Object: {
Bucket: bucket,
Name: key
}
},
FeatureTypes: ["TABLES", "FORMS"],
NotificationChannel: {
RoleArn: process.env.TEXT_EXTRACT_ROLE,
SNSTopicArn: process.env.SNS_TOPIC
}
};
const reponse = await textract.startDocumentAnalysis(params).promise();
console.log(reponse);
} catch (err) {
console.log(err);
} finally {
callback(null);
}
};
exports.textractEndHandler = async (event, context, callback) => {
try {
const {
Sns: { Message }
} = event.Records[0];
const {
JobId: jobId,
Status: status,
DocumentLocation: { S3ObjectName, S3Bucket }
} = JSON.parse(Message);
if (status === "SUCCEEDED") {
const textResult = await getDocumentText(jobId, null);
const params = {
Bucket: S3Bucket,
Key: `${path.parse(S3ObjectName).name}.txt`,
Body: textResult
};
await s3.putObject(params).promise();
}
} catch (error) {
callback(error);
} finally {
callback(null);
}
};
const getDocumentText = async (jobId, nextToken) => {
console.log("nextToken", nextToken);
const params = {
JobId: jobId,
MaxResults: 100,
NextToken: nextToken
};
if (!nextToken) delete params.NextToken;
let {
JobStatus: _jobStatus,
NextToken: _nextToken,
Blocks: _blocks
} = await textract.getDocumentAnalysis(params).promise();
let textractResult = _blocks
.map(({ BlockType, Text }) => {
if (BlockType === "LINE") return `${Text}${EOL}`;
})
.join();
if (_nextToken) {
textractResult += await getDocumentText(jobId, _nextToken);
}
return textractResult;
};
exports.getTextractResult = async (event, context, callback) => {
try {
if (event.body) {
const body = JSON.parse(event.body);
if (body.jobId) {
const params = {
JobId: body.jobId,
MaxResults: 100,
nextToken: body.nextToken
};
!params.nextToken && delete params.nextToken;
let {
JobStatus: jobStatus,
NextToken: nextToken,
Blocks: blocks
} = await textract.getDocumentAnalysis(params).promise();
if (jobStatus === "SUCCEEDED") {
textractResult = blocks
.map(({ BlockType, Text }) => {
if (BlockType === "LINE") return `${Text}${EOL}`;
})
.join();
}
return callback(null, {
statusCode: 200,
body: JSON.stringify({
text: textractResult,
jobStatus,
nextToken
})
});
}
}
} catch ({ statusCode, message }) {
return callback(null, {
statusCode,
body: JSON.stringify({ message })
});
} finally {
return callback(null);
}
};
$sam deploy --guided
$aws s3 cp ~/downloads/ocrscan.pdf s3://aiyi.demo.textract
You will get a Textract job id in CloudWatch lamba function TextractStartFunction’s log group, to monitor CloudWatch logs realtime you can run following command:
$sam logs --name TextractStartFunction -t --region YOUR_REGION --stack-name sam-app-appv2
$curl -d '{"jobId":"xxxxx2bd5ad43875edxxxx5aee29b65f273fxxxxx"}' -H "Content-Type: application/json" //xxxx.execute-api.ap-southeast-2.amazonaws.com/textract | jq '.'