Question: Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's contents The following code converts the pages of

Python | Reading contents of PDF using OCR (Optical CharacterRecognition) and renaming PDF file based on it's contents

The following code converts the pages of the PDF to images andthen, uses OCR (Optical Character Recognition) to read the contentfrom the image and stores it in a text file.

I want to then be able to rename the pdf file based on itscontents. For example if a pdf file has the words "cooking","baking" "ingredients" I'd want to rename the pdf file"Recipes.pdf"

Any help would be appreciated!

# Import libraries

from PIL import Image

import pytesseract

import sys

from pdf2image import convert_from_path

import os

# Path of the pdf

PDF_file = "d.pdf"

'''

Part #1 : Converting PDF to images

'''

# Store all the pages of the PDF in a variable

pages = convert_from_path(PDF_file, 500)

# Counter to store images of each page of PDF to image

image_counter = 1

# Iterate through all the pages stored above

for page in pages:

# Declaring filename for each page ofPDF as JPG

# For each page, filename will be:

# PDF page 1 -> page_1.jpg

# PDF page 2 -> page_2.jpg

# PDF page 3 -> page_3.jpg

# ....

# PDF page n -> page_n.jpg

filename ="page_"+str(image_counter)+".jpg"

# Save the image of the page insystem

page.save(filename, 'JPEG')

# Increment the counter to updatefilename

image_counter = image_counter + 1

'''

Part #2 - Recognizing text from the images using OCR

'''

# Variable to get count of total number of pages

filelimit = image_counter-1

# Creating a text file to write the output

outfile = "out_text.txt"

# Open the file in append mode so that

# All contents of all images are added to the same file

f = open(outfile, "a")

# Iterate from 1 to total number of pages

for i in range(1, filelimit + 1):

# Set filename to recognize textfrom

# Again, these files will be:

# page_1.jpg

# page_2.jpg

# ....

# page_n.jpg

filename = "page_"+str(i)+".jpg"

# Recognize the text as string in imageusing pytesserct

text =str(((pytesseract.image_to_string(Image.open(filename)))))

# The recognized text is stored invariable text

# Any string processing may be appliedon text

# Here, basic formatting has beendone:

# In many PDFs, at line ending, if aword can't

# be written fully, a 'hyphen' isadded.

# The rest of the word is written in thenext line

# Eg: This is a sample text this wordhere GeeksF-

# orGeeks is half on first line,remaining on next.

# To remove this, we replace every '-'to ''.

text = text.replace('-','')

# Finally, write the processed text tothe file.

f.write(text)

# Close the file after writing all the text.

f.close()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

To rename the PDF file based on its contents you can add the f... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

The FL Crime data file on the text CD has data for the 67 counties in Florida on y = crime rate: Annual number of crimes in county per 1000 population x1 = education: Percentage of adults in county...

A textbook has 500 pages on which typographical errors could occur. Suppose that there are exactly 10 such errors randomly located on those pages. Find the probability that a random selection of 50...

In Example 8.5 on page 500 we find evidence from the ANOVA of a difference in mean pulse rate among students depending on their award preference. The ANOVA table and summary statistics for pulse...

Describe the manner in which data elements are linked across databases. Almost every city street in London is under constant video surveillance, partly as a reaction to terrorist attacks. These...

What is optical character recognition?

Please help Python code to finish parts a, b, and c. I also attached sample python code #do not change import sys import os from time import time %matplotlib inline from urllib.request import urlop...

The final review stage of the audit involves taking an overall view of the financial statements and assessing how fairly they reflect the underlying economic events and conditions that occurred in...

Write a Verilog module that could read numbers from a text file line by line and sort them by a user-defined task. The text file is named as sort.txt, which contains 10 positive integers. The ten...

1. Imagine we have two possibilities: We can scan and email the image, or we can use an optical character reader (OCR) and send the text file. Discuss the advantage and disadvantages of the two...

Can any one tell me why I receiving this error message: NameError: name 'main' is not defined. Did you mean: 'min'? it also saying Unresolved reference'main' code: import re import os class.

A checking account had the following activity over a 2-days period: a withdrawal of $35.47, a deposit of $92.63, and a service charge of $2.13. If the balance after this activity was $174.13, what...

On January 1, 2024, Hobart Manufacturing Company purchased a drill press at a cost of $36,000. The drill press is expected to last 10 years and has a residual value of $6,000. During its 10-year...

1. Consider the series 2n+3 2n+5 Decide en en+1 n=1 [2 marks] whether it converges or diverges; if it converges, find the limit. 2. f(2) = 82-3n 27" n=1 Throughout it, let be the function defined by...

HR managers develop strategies to systematically monitor the major factors influencing their organization to identify trends that might affect the formulation and implementation of both...

Assess each strategy based on cost, beneficial long-term effects, & moral outcomes. Thoroughly, apply a minimum of four ethical theories to assess options. Ensure evaluation is in-depth, insightful,...

Refer to Step 5.1. Calculate the weights of VBTLX and VFIAX that achieve a portfolio allocation that represents an "optimal risky portfolio" on the efficient frontier, as of the end of December 2015....

Total operating expenses on Opal Company's income statement for last year totaled $350,000. During the year, the accrued liabilities decreased by $15,000 and prepaid expenses increased by $10,000....

What is EBIT/eps analysis? What information does it provide managers?

a. Describe four food-preparation and food-maintenance practices in your own kitchen that could expose people to food poisoning, and explain how to prevent them. b. What is a good general rule to...

What does it mean to say that the human body is 90% prokaryotic?

Some human pathogenic bacteria are resistant to most antibiotics. How would you prove a bacterium is resistant to antibiotics using laboratory culture techniques?

A Chinese proverb warns, The fire you kindle for your enemy often burns you more than him. How is this true of Type A individuals?

Can you remember a time when you felt better after discussing a problem with a loved one, or even after playing with your pet? How did it help you to cope?

What does a polygraph measure and why are its results questionable?