top of page

Data Digitization

Writer's picture: Neha JainNeha Jain

Make life Simple,


Data digitalization is the process of converting data from analog to digital formats, and then making it more accessible and useful.


Data digitalization can have many benefits, including: 

  • Improved decision making: Real-time data can be used to make informed decisions and improve operations. 

  • Increased efficiency: Automating repetitive tasks can reduce costs and increase productivity. 

  • Improved safety: Automated systems can be safer and more reliable than manual systems. 

  • Ease of access: Digital information can be easily stored, accessed, and shared. 

  • Easy data analysis: Digital information can be manipulated more easily than analog information. 


Few problems do not require complex solutions and some simpler techniques can be employed to solve them. This Blog talks about one of such solution. 


Even in the age of digitization a lot of industries still have to use physical documents. Banking and Insurance industries still use hard copies of KYC documents for verification. Even though they have the option of digital KYC, in developing countries like India, physical copies are a convenient way for customers. Companies have to spend a lot of resources just to extract/validate these KYC documents. 

There are a lot of solutions available to solve the entity extraction from KYC documents. We developed a solution with minimal resources but is highly effective. 


Keywords  

KYC, Regex, Pattern Matching, Aadhaar, PAN 


Introduction 

KYC documents usually contain fields like Name, Address, Date of Birth, ID number etc. We've used Regular Expressions to extract these entities. For regular expressions to work, document formats should be static, hence it makes KYC document a good use case.   

Types of document considered for evaluation: 

PAN, Aadhar, Death Certificate 


Approach 

We used the following three ways to come up with rules: 

Approach 1: Regular Expression pattern as a required entity 

Approach 2: Text around Regular Expression pattern as a required entity. 

Approach 3: Text relative to another entity as a required entity 

Grid Search 

Few of all of the above three approaches are implemented for each document type. 

By using the grid search technique we found the optimal sequence of approaches. 

PAN 

Let's take an example of entity extraction from PAN. 

Entities need to be extracted: PAN Number, Birth Date, First Name, Fathers Name 

PAN Number - After going through all PAN documents(100) in the dataset, the following approaches/patterns were found. 

Approach 1:  

Pattern 1A: ([a-z]{5}\d{4}[a-z]{1})        --Perfect OCR

Pattern 1B: ([a-z]{5}\d{1,4}[a-z]{1,3})  --Bad OCR

Approach 2: 

Pattern 2A: (?:Number|Number Card)\s*([a-z]{5}\d{4}[a-z]{1}) --Perfect OCR

Pattern 2B: (?:Number|Number Card)\s*([a-zA-Z0-9 ]{8,13}) --Bad OCR

 As we have 4 patterns and we need to run them sequentially, as soon as we find a matched pattern, we return it as an entity found. We should not put these patterns in random sequence because it will place a patternless likely to appear at top of sequence, and then chances of false positives increases. So to find out patterns most likely to appear in PAN documents we use the grid search technique(commonly used in ML to find optimal hyperparameters). 

As there are 4 patterns, the total number of possible sequences is 4!=24. For each sequence out of 24: 

Calculate accuracy 

Return sequence with the highest accuracy  

Thus, we find optimal sequence of patterns - Pattern 2A,Pattern 2B,Pattern 1A, Pattern 1B 

Birth Date 

Approach 1:  

Pattern 1A: ([0-9]{2}(\/|-)[0-9]{2}(\/|-)[0-9]{4})          

Pattern 1B: ([0-9-\/]+)    

By using a grid search, the optimal sequence of patterns - Pattern 1A, Pattern 1B 


Fathers Name 

Approach 2:  

Pattern 2A: (?:Father\'s|Father)\s(?:Name)\s([a-z]*\s*[a-z]*\s*[a-z]*) 

Approach 3: Father Name is located above the birth date. 

 Find Birthdate using the method above. 

 The text above birthdate is fathers name 

By using grid search, optimal sequence - Pattern 2A, Approach 3 


First Name 

 Approach 1:  

Pattern 2A: (?:Name)\s*([a-z]*\s*[a-z]*\s*[a-z]*) 

Approach 3: Name is located above Father's Name. 

 Find Fathers Name using method above 

 The text above Fathers Name is Name 

 By using grid search, optimal sequence - Pattern 2A, Approach 3 


Dataset 

100 images of each document type 


Experiments  

  1. We tried with different OCR providers and got the following results. It is interesting to know that for the same image, OCR engines might produce completely different results.  

  2. We found that Google Vision gives the best OCR results compared to others in following scenarios which are evident in Indian KYC documents: Regional languages identified, vertical paragraph separation. 

  3. Different image cleaning techniques applied, but the image cleaning pipeline can't be generalized for all image qualities. So we only used an image enhancement process/increasing DP ratio. 


Tools/Packages

're'-python library for regular expressions 

Cloud OCR like AWS Textract, Azure Vision, Google Vision  


Performance measures

Coverage = (Correct + Incorrect)/Total Expected 

Accuracy = Correct/Total Expected 

PAN

Coverage Accuracy

AWS Textract 96 81

Azure Vision 96 93

Google Vision 98 94


Aadhar

Coverage Accuracy

AWS Textract 75 72

Azure Vision 93 90

Google Vision 94 91


Death Certificate

Coverage Accuracy

AWS Textract 70 65

Azure Vision 82 80

Google Vision 85 75


Use cases 

Banking, Insurance, or similar industries where KYC documents are processed. 

A similar approach can be applied for cases where document structure is static, don’t have complicated tables and forms e.g. Passport.


Challenges  

  1. Regular Expressions depend on OCR test and OCR output for a same file is not consistent across different OCR engines, hence same Regular Expressions can't be used across different OCR engines.  

  2. If document type has many variations in templates then regular expressions fail on unseen documents. For Example, Death certificate template changes with every hospital hence Regular Expressions have shown very little accuracy in our results.  


Further work

Applying generic image cleaning processes will improve OCR quality and thus will help improve accuracy. 

Deep Learning-based solution using Graph Convolutional Neural Network can be used, given a large dataset of KYC documents. 


I hope you like my post, so keep tuned for the next one ... keep Reading and stay healthy :)


17 views

Recent Posts

See All
bottom of page