Make life Simple,
Data digitalization is the process of converting data from analog to digital formats, and then making it more accessible and useful.
Data digitalization can have many benefits, including:
Improved decision making: Real-time data can be used to make informed decisions and improve operations.
Increased efficiency: Automating repetitive tasks can reduce costs and increase productivity.
Improved safety: Automated systems can be safer and more reliable than manual systems.
Ease of access: Digital information can be easily stored, accessed, and shared.
Easy data analysis: Digital information can be manipulated more easily than analog information.
Few problems do not require complex solutions and some simpler techniques can be employed to solve them. This Blog talks about one of such solution.
Even in the age of digitization a lot of industries still have to use physical documents. Banking and Insurance industries still use hard copies of KYC documents for verification. Even though they have the option of digital KYC, in developing countries like India, physical copies are a convenient way for customers. Companies have to spend a lot of resources just to extract/validate these KYC documents.
There are a lot of solutions available to solve the entity extraction from KYC documents. We developed a solution with minimal resources but is highly effective.
Keywords
KYC, Regex, Pattern Matching, Aadhaar, PAN
Introduction
KYC documents usually contain fields like Name, Address, Date of Birth, ID number etc. We've used Regular Expressions to extract these entities. For regular expressions to work, document formats should be static, hence it makes KYC document a good use case.
Types of document considered for evaluation:
PAN, Aadhar, Death Certificate
Approach
We used the following three ways to come up with rules:
Approach 1: Regular Expression pattern as a required entity
Approach 2: Text around Regular Expression pattern as a required entity.
Approach 3: Text relative to another entity as a required entity
Grid Search
Few of all of the above three approaches are implemented for each document type.
By using the grid search technique we found the optimal sequence of approaches.
PAN
Let's take an example of entity extraction from PAN.
Entities need to be extracted: PAN Number, Birth Date, First Name, Fathers Name
PAN Number - After going through all PAN documents(100) in the dataset, the following approaches/patterns were found.
Approach 1:
Pattern 1A: ([a-z]{5}\d{4}[a-z]{1}) --Perfect OCR
Pattern 1B: ([a-z]{5}\d{1,4}[a-z]{1,3}) --Bad OCR
Approach 2:
Pattern 2A: (?:Number|Number Card)\s*([a-z]{5}\d{4}[a-z]{1}) --Perfect OCR
Pattern 2B: (?:Number|Number Card)\s*([a-zA-Z0-9 ]{8,13}) --Bad OCR
As we have 4 patterns and we need to run them sequentially, as soon as we find a matched pattern, we return it as an entity found. We should not put these patterns in random sequence because it will place a patternless likely to appear at top of sequence, and then chances of false positives increases. So to find out patterns most likely to appear in PAN documents we use the grid search technique(commonly used in ML to find optimal hyperparameters).
As there are 4 patterns, the total number of possible sequences is 4!=24. For each sequence out of 24:
Calculate accuracy
Return sequence with the highest accuracy
Thus, we find optimal sequence of patterns - Pattern 2A,Pattern 2B,Pattern 1A, Pattern 1B
Birth Date
Approach 1:
Pattern 1A: ([0-9]{2}(\/|-)[0-9]{2}(\/|-)[0-9]{4})
Pattern 1B: ([0-9-\/]+)
By using a grid search, the optimal sequence of patterns - Pattern 1A, Pattern 1B
Fathers Name
Approach 2:
Pattern 2A: (?:Father\'s|Father)\s(?:Name)\s([a-z]*\s*[a-z]*\s*[a-z]*)
Approach 3: Father Name is located above the birth date.
Find Birthdate using the method above.
The text above birthdate is fathers name
By using grid search, optimal sequence - Pattern 2A, Approach 3
First Name
Approach 1:
Pattern 2A: (?:Name)\s*([a-z]*\s*[a-z]*\s*[a-z]*)
Approach 3: Name is located above Father's Name.
Find Fathers Name using method above
The text above Fathers Name is Name
By using grid search, optimal sequence - Pattern 2A, Approach 3
Dataset
100 images of each document type
Experiments
We tried with different OCR providers and got the following results. It is interesting to know that for the same image, OCR engines might produce completely different results.
We found that Google Vision gives the best OCR results compared to others in following scenarios which are evident in Indian KYC documents: Regional languages identified, vertical paragraph separation.
Different image cleaning techniques applied, but the image cleaning pipeline can't be generalized for all image qualities. So we only used an image enhancement process/increasing DP ratio.
Tools/Packages
're'-python library for regular expressions
Cloud OCR like AWS Textract, Azure Vision, Google Vision
Performance measures
Coverage = (Correct + Incorrect)/Total Expected
Accuracy = Correct/Total Expected
PAN
Coverage Accuracy
AWS Textract 96 81
Azure Vision 96 93
Google Vision 98 94
Aadhar
Coverage Accuracy
AWS Textract 75 72
Azure Vision 93 90
Google Vision 94 91
Death Certificate
Coverage Accuracy
AWS Textract 70 65
Azure Vision 82 80
Google Vision 85 75
Use cases
Banking, Insurance, or similar industries where KYC documents are processed.
A similar approach can be applied for cases where document structure is static, don’t have complicated tables and forms e.g. Passport.
Challenges
Regular Expressions depend on OCR test and OCR output for a same file is not consistent across different OCR engines, hence same Regular Expressions can't be used across different OCR engines.
If document type has many variations in templates then regular expressions fail on unseen documents. For Example, Death certificate template changes with every hospital hence Regular Expressions have shown very little accuracy in our results.
Further work
Applying generic image cleaning processes will improve OCR quality and thus will help improve accuracy.
Deep Learning-based solution using Graph Convolutional Neural Network can be used, given a large dataset of KYC documents.
I hope you like my post, so keep tuned for the next one ... keep Reading and stay healthy :)