How AWS Textract Helped in Text Extraction From Handwritten Images For COVID 19 Resources

In this blog, I will explain what challenges we faced during collecting information from various sources and its implementation for the Mission Humane project.


What is Mission Humane?


In COVID-19 crisis where people are suffering from the life-threatening virus and facing a lot of difficulties in getting life-saving resources like oxygen concentrators, hospital beds, ambulance services,etc Moreover, on the other side the demand for these resources is increasing and people are contributing to save a life by sharing resources information.


The sources of these life-saving resources mainly were in the form of printed and handwritten images.


For more information please visit: https://app.missionhumane.org

App:missionhumane.org


Challenges faced During Retrieval of information:


Most of the information came from images from social networks. Some images were shared by people whereas others tried to help just by writing out the phone number, resources available on paper and sending the pictures.


The major challenge we faced was reading text from blur images and sometimes even handwritten text which varies from person to person.


During this crucial time where authenticated information needs to be updated in a short time, we utilized Computer Vision Technology to read and extract text from images and Hand-Written images and captured the text in a structured way, and make it readily available for end-users.


Some of the images are shown below for the various images received.





What AWS Textract is?


AWS Textract is a service provided by Amazon that allows automatic- Text extraction from handwritten and scanned documents or images. In today's digitalized world many companies face challenges of extracting data from scanned documents which may in various formats like PDF, Tables and Form’s. This service helps in automating the manual extraction of data which can the save time of human hours as it’s a service that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.


To use the service, follow the below process


Prerequisites:

Before you can run the examples in this section, you have to configure your environment.

To configure your environment

Create or update an IAM user with AmazonTextractFull Access permissions. For more information, see

Step 1: Set Up an AWS Account and Create an IAM User.

Step 2: Assign a Permission to this user for - AmazonTextractFullAccess

Step 3: Environment Setup

3.a) install boto3 on your enviornment

https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

b) Configure environment - Example is using Environment Variables https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html

c) Image’s path accessible to program with read permissions


Implementing Textract Code:


! pip install boto3

import boto3


#you need to create a session for accessing the service of AWS

import boto3.session


# Set Environment Variables for below to make successful connection to AWS

#aws_access_key_id, aws_secret_access_key, region_name

my_session = boto3.session.Session()

client = my session.client('textract')


# OR you can pass the values like while accessing the resource “Textract”

#client =boto3.client('textract',aws_access_key_id=" ",aws_secret_access_key=" ",region_name=" ")


# Input Document

documentName =(r"C:/Users//Desktop/example.jpg")


# Read document content

with open(documentName, 'rb') as document:

imageBytes = bytearray(document.read())


# Call Amazon Textract detect_document_text function and access the image to be processed

response = client.detect_document_text(Document={'Bytes': imageBytes})


#This will detect if it has LINE in the text and print the text

for item in response["Blocks"]:

if item["BlockType"]=="LINE":

print(item["Text"])




Analyzing Text in structured format such as forms, tables

Amazon Textract analysis operations return three categories of text extraction — text, forms, and tables.Amazon Textract can extract tables, table cells, and the items within table cells and may be programmed to return the results in a JSON, .csv, or a .txt file.

To specify which type of analysis to perform, you can use the FeatureTypes list input parameter. Add TABLES to the list to return information about the—for example, table cells, cell text, and selection elements in cells. Add FORMS to return word relationships, such as key-value pairs and selection elements. To perform both types of analysis, add both TABLES and FORMS to FeatureTypes.


In below example I have shown how to extract Table information:



Implementing Textract Code:


! pip install boto3

import boto3


#you need to create a session for accessing the service of AWS

import boto3.session


# Set Environment Variables for below to make successful connection to AWS

#aws_access_key_id, aws_secret_access_key, region_name

my_session = boto3.session.Session()

client = my session.client('textract')


# OR you can pass the values like while accessing the resource “Textract”

#client =boto3.client('textract',aws_access_key_id=" ",aws_secret_access_key=" ",region_name=" ")


# Input Document

documentName =(r"C:/Users//Desktop/example.jpg")


# Read document content

with open(documentName, 'rb') as document:

imageBytes = bytearray(document.read())


# Call Amazon Textract Analyse_document_text function and Mention the Feature Type