Deep Learning Machine Learning Python

Create Ocrized PDFs In 2 Steps7 min read

Create Ocrized PDFs In 2 Steps7 min read

Reading Time: 5 minutes

In this article, we will demonstrate how to create ocrized PDFs from images, scanned PDFs, etc. to run word searches on them. Below is the breakdown of the two-step process that we use at Mindee to accomplish this task.

  1. First, we use an open-source tool called Mindee docTR to perform OCR (Optical Character Recognition) on the image or scanned PDF. The docTR OCR results are then exported as an XML file in hOCR format.
  2. Lastly, we convert the hOCR file to PDF using an open-source tool, OCRmyPDF

Why would you want to ocrize your PDF?

We ocrize the images as well as the scanned documents or PDFs so that we may search for certain keywords or phrases within them. A few lines of code is all that’s needed to do this. With the approach we present, we’re also able to exhaustively ocrize the texts embedded within the images, which are normally left out (logos, watermarks, etc.). 

To better understand why we need to ocrize documents, let’s take a look at two use cases, which involve searching through a huge PDF and searching through a folder full of PDFs.

Below is a non-exhaustive list of documents that can be categorized under the two use-cases.

Searching through a huge PDF, such as:

  • contracts (terms and conditions, loan contracts, employment contracts, etc)
  • specifications
  • scientific and technical reports
  • insurance policies
  • request for information/request for quotation/request for proposals

Searching through a folder full of PDFs :

  • resumes (find a specific skill)
  • questionnaires/forms (find a specific answer)
  • invoices/receipts/quotations (find a specific item/customer/supplier )
  • presentations (find any keyword)
  • old scanned news articles (find specific news)

Creating an ocrized PDF makes it easier for non-developers and developers alike to search for a specific keyword in the various use cases listed above while using their favourite PDF reader. 

Why use docTR for ocrizing PDFs?

Quick catch-up with docTR

  • docTR is one of the best open-source OCR solutions available on the market. It uses state-of-the-art detection and recognition models to seamlessly process documents for Natural Language Understanding tasks. With just 3 lines of code, we can load a document and extract text with a predictor!
pip install python-doctr

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
======================================

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)

docTR offers pretrained backbones such as dbresnet50rotation for both detection and recognition. For more information on available backbones, please refer to the documentation page. Another major perk of using docTR over the existing open-source packages is that it can be trained with small rotations. This makes docTR more robust for the ocrization tasks. The list of supported vocabs can be found here

docTR Performance

Using example datasets, the table below compares docTR against some alternative OCR solutions.

Note: The dataset used for the comparison could not be made public due to the sensitive information included in it.

ReceiptsInvoicesIDs
ArchitectureRecallPrecisionRecallPrecisionRecallPrecision
(docTR)db_resnet50 + master7981.4265.5769.8651.3452.9
(docTR)db_resnet50 + sar_resnet3178.9481.3765.8970.7951.7853.35
AWS textract75.7777.770.4769.1346.3943.32
Gvision doc. text detection68.9159.8963.252.8543.729.21

We have also included some comparisons of public datasets FUNSD and CORD below.

FUNSDCORD
ArchitectureRecallPrecisionRecallPrecision
(docTR)db_resnet50 + master71.0376.0684.4981.94
(docTR)db_resnet50 + sar_resnet3171.2576.2984.581.96
AWS textract78.18387.566
Gvision doc. text detection6453.368.961.1

The above OCR models have been evaluated using both the training and evaluation sets of FUNSD and CORD. For further information regarding the metrics being used, see Task evaluation.

Jumping to codebase!

To create lightweight ocrized PDFs using docTR and OCRmyPDF, we will start by installing Mindee docTR and OCRmyPDF.

# installing requirements
!pip install "python-doctr[tf]"
!pip install ocrmypdf

You can use this example or any image/ scanned PDF of your choice, but for the sake of this tutorial, we are going to use the image below.

Below is our chosen image for the demo:

To download our sample image, you can run the following code:

# sample input image
!wget https://pbs.twimg.com/media/B_UpX3WU8AA2j3r.jpg -O ./data/images/image.jpg

Alternatively, you can download and save the image on your computer.  

As iterated earlier, we are breaking the process into two steps:

  1. Define the output folders for the output PDF and the hOCR data related to the docTR results.
import os
# define output folder
output_folder = "./output/"
output_hocr_folder = output_folder + "hocr/"
output_pdf_folder = output_folder + "pdf/"

os.makedirs(output_hocr_folder,exist_ok=True)
os.makedirs(output_pdf_folder,exist_ok=True)

Then load the image. 

Note: if you are using a scanned PDF, you’ll need to use the DocumentFile.from_pdf method instead and run an OCR with docTR.

from doctr.models import ocr_predictor
from doctr.io import DocumentFile

# load image
image_path = "./data/images/image.jpg"

# extracting text from input image using docTR
docs = DocumentFile.from_images(image_path)

# load model
model = ocr_predictor(
            det_arch='db_resnet50',
            reco_arch='crnn_vgg16_bn',
            pretrained=True
)

result = model(docs)

# display ocr boxes
result.show(docs)

Below we can see the docTR result which shows the detected and highlighted text in the image.

  1. For the next step, export the docTR OCR results as an XML file in hOCR format.
# export xml file
xml_outputs = result.export_as_xml()
with open(os.path.join(output_hocr_folder,"doctr_image_hocr.xml"),"w") as f :
    f.write(xml_outputs[0][0].decode())

After exporting the hOCR result of docTR as XML, we can use OCRmyPDF to convert it to an ocrized pdf.

from ocrmypdf.hocrtransform import HocrTransform
output_pdf_path = output_pdf_folder + "hocr_output.pdf"

hocr = HocrTransform(
    hocr_filename=output_pdf_path,
    dpi=300
)

# step to obtain ocirized pdf
hocr.to_pdf(
    out_filename=output_pdf_path,
    image_filename=image_path,
    show_bounding_boxes=False,
    interword_spaces=False,
)

Voila! Now we have created your ocrized PDF as desired.

How to search a folder with multiple ocrized PDFs

In the Ubuntu terminal, for example, you may use the Ubuntu pdfgrep command to search a folder full of numeric or ocrized PDFs.

To do this, let’s first install pdfgrep

# first let's install pdfgrep
sudo apt-get update
sudo apt-get install pdfgrep

Now we can use pdfgrep to search for any information using a keyword. We can do simple searches with an exact match or use a regex for more flexibility. Let’s look at some examples:

Below, we want to look for year-specific information using the keyword “Year.”

pdfgrep -r "Year"

./hocr_output.pdf: APPLICANTS   ForPublication Year2015-2016

We can also search for a specific time-lapse, say from 2010 to 2019, using a simple regex.

pdfgrep -r -P "\b201[0-9]\b"

./hocr_output.pdf: APPLICANTS   ForPublication Year2015-2016
./hocr_output.pdf:andsubmittotheVarsitarianofficeonorbeforeMARCH:27,2015.

From the above examples, you can see how easy it is to leverage ocrized PDF search power on a folder – using only a few lines of command.

Why OCRmyPDF?

OCRmyPDF is an application and library that adds text “layers” to images in PDFs, making scanned image PDFs searchable. It includes an image-oriented PDF optimizer, which by default runs with safe settings with the goal of improving compression with no loss of quality. Optimizations only occur after OCR and only if OCR succeeds. Optimization ranges from -00 through -03, where 0 disables optimization and 3 implements all options. In addition, it comes with tons of other options such as rotation correction, batch processing, selective ocrization, and so on! 

This article helps overcome the major limitation of OCRmyPDF, which is limited by the Tesseract OCR engine. As a result, Tesseract is not as accurate as a state-of-the-art OCR solution. Poor quality scans could produce poor quality OCR. That is the reason we went with docTR as a replacement for the default OCR engine of OCRmyPDF.

Photo credit Canva Photo Collage

Your email address will not be published.