Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's

Fantastic news! We've Found the answer you've been seeking!

Question:

Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's contents

The following code converts the pages of the PDF to images andthen, uses OCR (Optical Character Recognition) to read the contentfrom the image and stores it in a text file.

I want to then be able to rename the pdf file based on itscontents. For example if a pdf file has the words "cooking","baking" "ingredients" I'd want to rename the pdf file"Recipes.pdf"

Any help would be appreciated!

# Import libraries

from PIL import Image

import pytesseract

import sys

from pdf2image import convert_from_path

import os

# Path of the pdf

PDF_file = "d.pdf"

'''

Part #1 : Converting PDF to images

'''

# Store all the pages of the PDF in a variable

pages = convert_from_path(PDF_file, 500)

# Counter to store images of each page of PDF to image

image_counter = 1

# Iterate through all the pages stored above

for page in pages:

# Declaring filename for each page ofPDF as JPG

# For each page, filename will be:

# PDF page 1 -> page_1.jpg

# PDF page 2 -> page_2.jpg

# PDF page 3 -> page_3.jpg

# ....

# PDF page n -> page_n.jpg

filename ="page_"+str(image_counter)+".jpg"

# Save the image of the page insystem

page.save(filename, 'JPEG')

# Increment the counter to updatefilename

image_counter = image_counter + 1

'''

Part #2 - Recognizing text from the images using OCR

'''

# Variable to get count of total number of pages

filelimit = image_counter-1

# Creating a text file to write the output

outfile = "out_text.txt"

# Open the file in append mode so that

# All contents of all images are added to the same file

f = open(outfile, "a")

# Iterate from 1 to total number of pages

for i in range(1, filelimit + 1):

# Set filename to recognize textfrom

# Again, these files will be:

# page_1.jpg

# page_2.jpg

# ....

# page_n.jpg

filename = "page_"+str(i)+".jpg"

# Recognize the text as string in imageusing pytesserct

text =str(((pytesseract.image_to_string(Image.open(filename)))))

# The recognized text is stored invariable text

# Any string processing may be appliedon text

# Here, basic formatting has beendone:

# In many PDFs, at line ending, if aword can't

# be written fully, a 'hyphen' isadded.

# The rest of the word is written in thenext line

# Eg: This is a sample text this wordhere GeeksF-

# orGeeks is half on first line,remaining on next.

# To remove this, we replace every '-'to ''.

text = text.replace('-','')

# Finally, write the processed text tothe file.

f.write(text)

# Close the file after writing all the text.

f.close()

Related Book For answer-question

Foundations in Microbiology

ISBN: 978-0073375298

8th edition

Authors: Kathleen Park Talaro, Barry Chess

See More Books

Posted Date: Nov 22, 2022 09:28 AM

See More Questions

Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's

Question:

Expert Answer:

To rename the PDF file based on its contents you can add the f... View the full answer

Foundations in Microbiology

Students also viewed these programming questions