In an earlier post, I described how to use Amazon Textract to extract lines of text from an image file. In today’s post, I describe how Textract can be used to extract lines of text from a PDF file. Some of the below information is duplicated from that initial post. If you’ve already got that set up, you can skip to the section on S3 buckets.

This post has instructions for using the Textract API with their PHP SDK. I’m using PHP version 7.0 on an Ubuntu 21 operating system. This demo works as of October 2021.

Step 1: Create the project

Create a folder for your project, for example:

mkdir ~/TextractDemo ; cd ~/TextractDemo

Instructions for getting started with the SDK for PHP are here. First, download the .zip file as described on that page. Then, extract the zip file to the root of your project. That adds a lot of files and folders to the project root. For example, the “Aws” folder is added. This is what you should see when listing the contents of this directory:

~/TextractDemo$ ls -lairt
total 676
  396747 -rw-r--r--   1 fullstackdev fullstackdev  10129 Sep 12 14:11 README.md
  531373 drwxr-xr-x   3 fullstackdev fullstackdev   4096 Sep 12 14:11 Psr
  396739 -rw-r--r--   1 fullstackdev fullstackdev   2881 Sep 12 14:11 NOTICE.md
  399132 -rw-r--r--   1 fullstackdev fullstackdev   9202 Sep 12 14:11 LICENSE.md
  926072 drwxr-xr-x   2 fullstackdev fullstackdev   4096 Sep 12 14:11 JmesPath
  396755 drwxr-xr-x   7 fullstackdev fullstackdev   4096 Sep 12 14:11 GuzzleHttp
  399129 -rw-r--r--   1 fullstackdev fullstackdev 478403 Sep 12 14:11 CHANGELOG.md
  396748 -rw-r--r--   1 fullstackdev fullstackdev 132879 Sep 12 14:11 aws-autoloader.php
  531270 drwxr-xr-x 203 fullstackdev fullstackdev  12288 Sep 12 14:11 Aws
  396729 drwxr-xr-x   6 fullstackdev fullstackdev   4096 Sep 15 09:48 .
13500418 drwxr-xr-x  46 fullstackdev fullstackdev  20480 Sep 15 09:49 ..

Step 2: Create an IAM User

In order to use the Textract API, you need an Amazon AWS account. So if you don’t have that already, go follow the instructions to do that now.

Assuming you’ve got an AWS account, next, you need to create an IAM (Identity and Access Management) user. If you are signed in to your AWS console, just search for “Identity and Access Management”, and it takes you to the right place to create an IAM user. There’s an area called “Create individual IAM users”. Go there, click the “Manage Users” button, click the “Add User” button, choose a name like TextractUser, and give this user programmatic access only. Once you’ve created the name, go to the next step, where you can add the user to a specific group. Create a group which has the AmazonTextractFullAccess policy name. Name it something like TextractFullAccessGroup, and save that. Add the user you just created to this group. The next step lets you add tags to the user, but you can leave that blank.

In the Review (last) step, you are given the user’s access key ID and secret key (which is hidden – you will have to reveal it to copy it). Save these in a secure place! As the documentation says, “This is the last time these credentials will be available to download. However, you can create new credentials at any time.” (So if you lose them somehow, you can always generate a new set.)

The credentials that you just created may be saved in the file ~/.aws/credentials on Linux systems. Here’s a quick rundown about that file.

If this file already exists, you can add to it. Here’s the documentation for adding lines to an AWS credentials file. On that page, it gives you an example credentials file with this content:

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
[user1]
aws_access_key_id=AKIAI44QH8DHBEXAMPLE
aws_secret_access_key=je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY

Instead of user1, add the line [TextractUser] (or whatever user name you used in the ‘creating user’ step above). Copy and paste your access key id and secret key as shown.

The credentials file is normally created when installing the AWS CLI. So if you do not already have a credentials file, install the CLI first. Then you can add users to the file.

Now we’re ready to use Textract.

We are going to try to detect text in a sample PDF file. An example PDF file is included with this project at Github. You will need to upload your PDF file to your own AWS S3 bucket. Creating an S3 bucket is easy, but beyond the scope of this blog post. Please follow Amazon’s documentation for creating and uploading a file to an S3 bucket. Once you do that, you will know the bucket name, and the PDF file name to use in the code below.

Important! Textract cannot extract text from a PDF file if it is not in an S3 bucket!! Don’t even try :)

Call Textract on a PDF using the SDK

We need source code to do two separate things.

First, we write one little program that creates a Textract client, and uses the client to call StartDocumentTextDetection. The second little program uses the output of the first to call GetDocumentTextDetection. You can’t do that until you’ve got information from the result of running the first program.

Let me explain a little further. The function StartDocumentTextDetection is asynchronous - it spins off a little worker that takes some time to process your document, and it won’t give you information immediately. So you need a way to contact AWS later to get the output of the worker that processed your PDF file. That is given to you by getting a ‘JobId’ from the output of StartDocumentTextDetection.

JobId is then used in your other little program to retrieve the text data you are trying to extract.

You can save the output of StartDocumentTextDetection in many different ways. In this example, we’re just going to print the ‘JobId’ value to the console, and then copy and paste it into our second program. It’s inefficient, but works well for the purposes of a demo. If you had a lot of documents to process, you would need to devise a way to automatically upload them to S3 buckets, process the documents, wait fot the output, retrieve it, and so on. But again, for a demonstration, this hacky procedure is simple and quick.

After we have our ‘JobId’ pasted into our next program, we can run that program to get the extracted text.

Now we know the general plan. Let’s do step 1: start the text extraction process. Just below, you see the first little program called textract_demo_StartDocumentTextDetection.php. To run it, do: php textract_demo_StartDocumentTextDetection.php. Before doing that, make sure you’ve:

  1. Set up your S3 bucket
  2. Uploaded your PDF file to the S3 bucket.
  3. Edited the code below to use your own profile and region, and make sure you have AWS credentials.
  4. Edited the code to refer to your own S3 bucket and file name and version.

textract_demo_StartDocumentTextDetection.php

<?php
/*
Copyright 2021 Marya Doery

MIT License https://opensource.org/licenses/MIT

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

/*
 * To run this project, make sure that the AWS PHP SDK has been unzipped in the current directory.
 * 
 * Caution: this is not production quality code. There are no tests, and there is no error handling.
 */
require './aws-autoloader.php';

use Aws\Credentials\CredentialProvider;
use Aws\Textract\TextractClient;

// If you use CredentialProvider, it will use credentials in your .aws/credentials file.
$provider = CredentialProvider::env();
$client = new TextractClient([
	'profile' => 'TextractUser',
	'region' => 'us-west-2',
	'version' => 'latest',
	'credentials' => $provider
]);

$bucket = 'my-textract-s3-bucket-us-west-2';
$keyname = 'my-special-file.pdf';
$version = 'qaEXAMPLEOH1REm3Dy.Ca9W4Gpqdj6Ro';

$startOptions = [
	'DocumentLocation' => [
		'S3Object' => [
			'Bucket' => $bucket,
			'Name' => $keyname,
			'Version' => $version,
		],
	],
    'FeatureTypes' => ['FORMS']
];

$object = $client->StartDocumentTextDetection($startOptions);

echo "output:\n" . print_r($object, true) . "\n";

$jobId = $object->get('JobId');

echo "JobId:\n" . print_r($jobId, true) . "\n";

?>

After running this code, you should see output with the ‘JobId’.

Now edit the next little program so that this ‘JobId’ is used to call GetDocumentTextDetection, and run php textract_demo_GetDocumentTextDetection.php.

textract_demo_GetDocumentTextDetection.php

<?php
/*
Copyright 2021 Marya Doery

MIT License https://opensource.org/licenses/MIT

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

/*
 * To run this project, make sure that the AWS PHP SDK has been unzipped in the current directory.
 * 
 * Caution: this is not production quality code. There are no tests, and there is no error handling.
 */


// See https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html
// https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html
require './aws-autoloader.php';

use Aws\Credentials\CredentialProvider;
use Aws\Textract\TextractClient;

// If you use CredentialProvider, it will use credentials in your .aws/credentials file.
$provider = CredentialProvider::env();
$client = new TextractClient([
	'profile' => 'TextractUser',
    'region' => 'us-west-2',
	'version' => 'latest',
	'credentials' => $provider
]);

$bucket = 'my-textract-s3-bucket-us-west-2';
$keyname = 'my-special-file.pdf';
$version = 'qaEXAMPLEOH1REm3Dy.Ca9W4Gpqdj6Ro';

// Output jobId should contain 64 hex digits, something like this:
// $jobId = 'ad6f...5346';

// Just hard-code the jobId using the output from textract_pdf.php.
// You should have stored it somewhere: in a database, for example.
$jobId = 'ad6f...5346';

$getOptions = [
	'JobId' => $jobId
];
$getObject = $client->GetDocumentTextDetection($getOptions);
// For debugging:
// echo "getObject:\n" . print_r($getObject, true) . "\n";

$blocks = $getObject->get('Blocks');

$JobStatus = $getObject->get('JobStatus');

if ($JobStatus == 'SUCCEEDED') {
    processResult($blocks);
} else {
    echo "Job failed with status " . $JobStatus;
}


// If debugging:
// echo print_r($result, true);
function processResult($blocks) {
	// Loop through all the blocks:
	foreach ($blocks as $key => $value) {
		if (isset($value['BlockType']) && $value['BlockType']) {
            // BlockType is WORD, LINE, or PAGE
			$blockType = $value['BlockType'];
			if (isset($value['Text']) && $value['Text']) {
				$text = $value['Text'];
				if ($blockType == 'WORD') {
					echo "Word: ". print_r($text, true) . "\n";
				} else if ($blockType == 'LINE') {
					echo "Line: ". print_r($text, true) . "\n";
				}
			}
		}
	}
}
?>

Run this code with php textract_demo_GetDocumentTextDetection.php. You should see output with words and lines extracted from the PDF file, like this:

php textract_demo_GetDocumentTextDetection.php
Line: Alice's Adventures in Wonderland
Line: by Lewis Carroll
Line: CHAPTER I.
Line: Down the Rabbit-Hole
Line: Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to
Line: do: once or twice she had peeped into the book her sister was reading, but it had no pictures or
Line: conversations in it, "and what is the use of a book," thought Alice "without pictures or
Line: conversations?"
...

If you don’t give the worker enough time, you may see an error output like this:

Job failed with status IN_PROGRESS

If that happens, just wait a while and try running the program again.

That’s it! Feel free to email me with any questions: fullstackdev@fullstackoasis.com.