Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's
Question:
Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's contents
The following code converts the pages of the PDF to images andthen, uses OCR (Optical Character Recognition) to read the contentfrom the image and stores it in a text file.
I want to then be able to rename the pdf file based on itscontents. For example if a pdf file has the words "cooking","baking" "ingredients" I'd want to rename the pdf file"Recipes.pdf"
Any help would be appreciated!
# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
# Path of the pdf
PDF_file = "d.pdf"
'''
Part #1 : Converting PDF to images
'''
# Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500)
# Counter to store images of each page of PDF to image
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page ofPDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename ="page_"+str(image_counter)+".jpg"
# Save the image of the page insystem
page.save(filename, 'JPEG')
# Increment the counter to updatefilename
image_counter = image_counter + 1
'''
Part #2 - Recognizing text from the images using OCR
'''
3
# Variable to get count of total number of pages
filelimit = image_counter-1
# Creating a text file to write the output
outfile = "out_text.txt"
# Open the file in append mode so that
# All contents of all images are added to the same file
f = open(outfile, "a")
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize textfrom
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in imageusing pytesserct
text =str(((pytesseract.image_to_string(Image.open(filename)))))
# The recognized text is stored invariable text
# Any string processing may be appliedon text
# Here, basic formatting has beendone:
# In many PDFs, at line ending, if aword can't
# be written fully, a 'hyphen' isadded.
# The rest of the word is written in thenext line
# Eg: This is a sample text this wordhere GeeksF-
# orGeeks is half on first line,remaining on next.
# To remove this, we replace every '-'to ''.
text = text.replace('-','')
# Finally, write the processed text tothe file.
f.write(text)
# Close the file after writing all the text.
f.close()
Foundations in Microbiology
ISBN: 978-0073375298
8th edition
Authors: Kathleen Park Talaro, Barry Chess