Question: Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's contents The following code converts the pages of

Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's contents

The following code converts the pages of the PDF to images andthen, uses OCR (Optical Character Recognition) to read the contentfrom the image and stores it in a text file.

I want to then be able to rename the pdf file based on itscontents. For example if a pdf file has the words "cooking","baking" "ingredients" I'd want to rename the pdf file"Recipes.pdf"

Any help would be appreciated!

# Import libraries

from PIL import Image

import pytesseract

import sys

from pdf2image import convert_from_path

import os

# Path of the pdf

PDF_file = "d.pdf"

'''

Part #1 : Converting PDF to images

'''

# Store all the pages of the PDF in a variable

pages = convert_from_path(PDF_file, 500)

# Counter to store images of each page of PDF to image

image_counter = 1

# Iterate through all the pages stored above

for page in pages:

# Declaring filename for each page ofPDF as JPG

# For each page, filename will be:

# PDF page 1 -> page_1.jpg

# PDF page 2 -> page_2.jpg

# PDF page 3 -> page_3.jpg

# ....

# PDF page n -> page_n.jpg

filename ="page_"+str(image_counter)+".jpg"

# Save the image of the page insystem

page.save(filename, 'JPEG')

# Increment the counter to updatefilename

image_counter = image_counter + 1

'''

Part #2 - Recognizing text from the images using OCR

'''

3

# Variable to get count of total number of pages

filelimit = image_counter-1

# Creating a text file to write the output

outfile = "out_text.txt"

# Open the file in append mode so that

# All contents of all images are added to the same file

f = open(outfile, "a")

# Iterate from 1 to total number of pages

for i in range(1, filelimit + 1):

# Set filename to recognize textfrom

# Again, these files will be:

# page_1.jpg

# page_2.jpg

# ....

# page_n.jpg

filename = "page_"+str(i)+".jpg"

# Recognize the text as string in imageusing pytesserct

text =str(((pytesseract.image_to_string(Image.open(filename)))))

# The recognized text is stored invariable text

# Any string processing may be appliedon text

# Here, basic formatting has beendone:

# In many PDFs, at line ending, if aword can't

# be written fully, a 'hyphen' isadded.

# The rest of the word is written in thenext line

# Eg: This is a sample text this wordhere GeeksF-

# orGeeks is half on first line,remaining on next.

# To remove this, we replace every '-'to ''.

text = text.replace('-','')

# Finally, write the processed text tothe file.

f.write(text)

# Close the file after writing all the text.

f.close()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

To rename the PDF file based on its contents you can add the f... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!