7,838 reads

Amazon Textract: Extract Text from PDF and Image Files [A How To Guide]

by Yi AiDecember 22nd, 2019

Too Long; Didn't Read

Amazon Textract is a service that automatically extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms and tables. In this post, I show how we can use AWS Textract to extract text from scanned pdf files. The following code example shows how to use a few lines of code to send a. pdf to a.pdf file to an S3 bucket. Another Lambda function will be triggered to get a. getDocumentAnalysisonce response. We then iterate over the blocks in. JSON and save the detected text to S3.

Company Mentioned

featured image - Amazon Textract: Extract Text from PDF and Image Files [A How To Guide]

Amazon recently released Textract in the Asia Pacific (Sydney), thus i decided to write a javascript OCR demo using Amazon Textract.

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

In this post, I show how we can use AWS Textract to extract text from scanned pdf files.

Overview of the process

Upload files to an S3 bucket.
A S3 event trigger will invoke an AWS Lambda function, which will call Amazon Textract asynchronous operations to analyse uploaded document and then push the status of the job to an SNS topic after document analysis job completed.
The SNS topic will invoke another Lambda function, which will read the status of the job, and if job status is SUCCEEDED, it will write the extracted text to a .txt object to S3 bucket.
A Http Api endpoint can also get extracted job status and result by giving job id.

The following diagram shows the architecture of the process.

Prerequisites

The following must be done before following this guide:

Setup an AWS account.
Configure the AWS CLI with user credentials.
Install .
jq (optional).

Before getting started, Install the and creates an application with sample code using

sam init -r nodejs12.x

Lets get started

There will be a SAM template file (template.yaml) in the project directory created. Let’s start to define a set of objects in template file as below:

lambda functions and inline policies;
S3 bucket
IAM role
SNS topic
Http Api

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Timeout: 60

Parameters:
  Stage:
    Type: String
    Default: dev
  BucketName:
    Type: String
    Default: aiyi.demo.textract

Resources:
  TextractSNSTopic:
    Type: AWS::SNS::Topic
    Properties:
      DisplayName: !Sub "textract-sns-topic"
      TopicName: !Sub "textract-sns-topic"
      Subscription:
        - Protocol: lambda
          Endpoint: !GetAtt TextractEndFunction.Arn

  TextractSNSTopicPolicy:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref TextractEndFunction
      Principal: sns.amazonaws.com
      Action: lambda:InvokeFunction
      SourceArn: !Ref TextractSNSTopic

  TextractEndFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: handler.textractEndHandler
      Runtime: nodejs12.x
      Role: !GetAtt TextractRole.Arn
      Policies:
        - AWSLambdaExecute
        - Statement:
            - Effect: Allow
              Action:
                - "s3:PutObject"
              Resource: !Join [":", ["arn:aws:s3::", !Ref BucketName]]

  TextractStartFunction:
    Type: AWS::Serverless::Function
    Properties:
      Environment:
        Variables:
          TEXT_EXTRACT_ROLE: !GetAtt TextractRole.Arn
          SNS_TOPIC: !Ref TextractSNSTopic
      Role: !GetAtt TextractRole.Arn
      CodeUri: src/
      Handler: handler.textractStartHandler
      Runtime: nodejs12.x
      Events:
        PDFUploadEvent:
          Type: S3
          Properties:
            Bucket: !Ref S3Bucket
            Events: s3:ObjectCreated:*
            Filter:
              S3Key:
                Rules:
                  - Name: suffix
                    Value: ".pdf"

  TextractRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: "TextractRole"
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Allow"
            Principal:
              Service:
                - "textract.amazonaws.com"
                - "lambda.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - "arn:aws:iam::aws:policy/AWSLambdaExecute"
      Policies:
        - PolicyName: "TextractRoleAccess"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - "sns:*"
                Resource: "*"
              - Effect: Allow
                Action:
                  - "textract:*"
                Resource: "*"

  GetTextractResult:
    Type: AWS::Serverless::Function
    Properties:
      Role: !GetAtt TextractRole.Arn
      CodeUri: src/
      Handler: handler.getTextractResult
      Runtime: nodejs12.x
      Events:
        TextExactStart:
          Type: HttpApi
          Properties:
            Path: /textract
            Method: post

  MyHttpApi:
    Type: AWS::Serverless::HttpApi
    Properties:
      StageName: !Ref Stage
      Cors:
        AllowMethods: "'OPTIONS,POST,GET'"
        AllowHeaders: "'Content-Type'"
        AllowOrigin: "'*'"

  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref BucketName

Note that API Gateway HTTP API AWS::Serverless::HttpApi which is still in beta and is subject to change, please don’t use it for production.The following code example shows how to use a few lines of code to send pdf to Amazon Textract asynchronous operations in a lambda function and another lambda function will be triggered to get json response back by calling getDocumentAnalysisonce once Textract analysis job is completed. We then iterate over the blocks in JSON and save the detected text to S3.

exports.textractStartHandler = async (event, context, callback) => {
  try {
    const bucket = event.Records[0].s3.bucket.name;
    const key = event.Records[0].s3.object.key;
    const params = {
      DocumentLocation: {
        S3Object: {
          Bucket: bucket,
          Name: key
        }
      },
      FeatureTypes: ["TABLES", "FORMS"],
      NotificationChannel: {
        RoleArn: process.env.TEXT_EXTRACT_ROLE,
        SNSTopicArn: process.env.SNS_TOPIC
      }
    };
    const reponse = await textract.startDocumentAnalysis(params).promise();
    console.log(reponse);
  } catch (err) {
    console.log(err);
  } finally {
    callback(null);
  }
};
exports.textractEndHandler = async (event, context, callback) => {
  try {
    const {
      Sns: { Message }
    } = event.Records[0];
    const {
      JobId: jobId,
      Status: status,
      DocumentLocation: { S3ObjectName, S3Bucket }
    } = JSON.parse(Message);
    if (status === "SUCCEEDED") {
      const textResult = await getDocumentText(jobId, null);
      const params = {
        Bucket: S3Bucket,
        Key: `${path.parse(S3ObjectName).name}.txt`,
        Body: textResult
      };
      await s3.putObject(params).promise();
    }
  } catch (error) {
    callback(error);
  } finally {
    callback(null);
  }
};
const getDocumentText = async (jobId, nextToken) => {
  console.log("nextToken", nextToken);
  const params = {
    JobId: jobId,
    MaxResults: 100,
    NextToken: nextToken
  };
if (!nextToken) delete params.NextToken;
let {
    JobStatus: _jobStatus,
    NextToken: _nextToken,
    Blocks: _blocks
  } = await textract.getDocumentAnalysis(params).promise();
let textractResult = _blocks
    .map(({ BlockType, Text }) => {
      if (BlockType === "LINE") return `${Text}${EOL}`;
    })
    .join();
if (_nextToken) {
    textractResult += await getDocumentText(jobId, _nextToken);
  }
return textractResult;
};

Now let’s add another lambda function as a REST endpoint using HTTP API defined in template.yaml. with the rest api, we can retrieve the text analysis result and job status by Textract job id.

exports.getTextractResult = async (event, context, callback) => {
  try {
    if (event.body) {
      const body = JSON.parse(event.body);
      if (body.jobId) {
        const params = {
          JobId: body.jobId,
          MaxResults: 100,
          nextToken: body.nextToken
        };
        !params.nextToken && delete params.nextToken;
        let {
          JobStatus: jobStatus,
          NextToken: nextToken,
          Blocks: blocks
        } = await textract.getDocumentAnalysis(params).promise();

        if (jobStatus === "SUCCEEDED") {
          textractResult = blocks
            .map(({ BlockType, Text }) => {
              if (BlockType === "LINE") return `${Text}${EOL}`;
            })
            .join();
        }
        return callback(null, {
          statusCode: 200,
          body: JSON.stringify({
            text: textractResult,
            jobStatus,
            nextToken
          })
        });
      }
    }
  } catch ({ statusCode, message }) {
    return callback(null, {
      statusCode,
      body: JSON.stringify({ message })
    });
  } finally {
    return callback(null);
  }
};

Note that Amazon Textract retains the results of asynchronous operations for 7 days.Now let’s deploy the service and test it out!

$sam deploy --guided

After deployment finished, copy a pdf file to S3 bucket.

$aws s3 cp ~/downloads/ocrscan.pdf s3://aiyi.demo.textract

You will get a Textract job id in CloudWatch lamba function TextractStartFunction’s log group, to monitor CloudWatch logs realtime you can run following command:

$sam logs --name TextractStartFunction -t --region YOUR_REGION --stack-name sam-app-appv2

Let’s check the job status by calling api endpoint we just deployed.

$curl  -d '{"jobId":"xxxxx2bd5ad43875edxxxx5aee29b65f273fxxxxx"}'  -H "Content-Type: application/json" //xxxx.execute-api.ap-southeast-2.amazonaws.com/textract | jq '.'