Question: Python code below. it is not displaying all the requirements: Remove all the punctuations and non-English words, then count the number of the rest

Python code below. it is not displaying all the requirements: 

 

  1. Remove all the punctuations and non-English words, then count the number of the rest of the words in the file
  2. Using the words after step 1 to build a word dictionary, all the words in the dictionary are unique (e.g. the word "But" and "but" should be considered as the same word)
    • Count the number of distinct words in your dictionary
    • The words in the dictionary should be displayed in an alphabetic order
  3. Select three sentences from the file, then use any POS tagging tools to identify POS tags in the selected sentences.

 

Code:

import string

import nltk

from collections import OrderedDict


 

# Download necessary NLTK data

nltk.download('averaged_perceptron_tagger')

nltk.download('words')


 

# Define function to check if a word is English

english_vocab = set(w.lower() for w in nltk.corpus.words.words())

def is_english(word):

    return word.lower() in english_vocab


 

try:

    with open(r'file_path', 'r') as f:

        text = f.read()


 

    # Preprocess: Remove punctuation and non-English words

    exclude = set(string.punctuation)

    text = ''.join(ch for ch in text if ch not in exclude and ch.isascii())

    words = text.split()

    words = [word for word in words if is_english(word)]


 

    # Count processed words and print

    total_processed_words = len(words)

    print(f"Total processed words: {total_processed_words}")


 

    # Build dictionary of unique words

    word_dict = OrderedDict()

    for word in words:

        word_lower = word.lower()

        if word_lower not in word_dict:

            word_dict[word_lower] = 1

        else:

            word_dict[word_lower] += 1


 

    # Count distinct words and print

    distinct_word_count = len(word_dict)

    print(f"Number of distinct words: {distinct_word_count}")


 

    # Print words in alphabetical order

    print("\nWords in alphabetical order:")

    for word in sorted(word_dict):

        print(word)


 

    # Select sentences and POS tag

    # Replace these sentences with ones from your file if necessary

    sentences = [

        "from fairest creatures we desire increase that thereby beautys rose might never die",

        "when forty winters shall besiege thy brow and dig deep trenches in thy beautys field",

        "for where is she so fair whose uneared womb disdains the tillage of thy husbandry"

    ]


 

    for sentence in sentences:

        pos_tags = nltk.pos_tag(sentence.split())

        print("\nSentence:", sentence)

        print("POS Tags:", pos_tags)


 

except Exception as e:

    print("An error occurred:", e)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Algorithms Questions!