Question: Python 3 Authorship attribution is a system which attempts to determine who wrote a given document, based on analysis of the language used and style

Python 3

"Authorship attribution" is a system which attempts to determine who wrote a given document, based on analysis of the language used and style of that document. We will break the system down into multiple parts, to make it clearer what the different moving parts are, and make it easier for you to test your system.

The first step in our authorship attribution system will be to take a document, separate it out into its component words, and construct/return a dictionary of word frequencies. As we are focused on the English language, we will assume that "words" are separated by whitespace, in the form of spaces (' '), tabs ('\t') and newline characters (' ').

We will also do something slightly unconventional in considering each "standalone" non-alphabetic character (i.e. any character other than whitespace, or upper- or lower-case alphabetic characters) to be a single word. For example, given the document 'Dynamic-typed variables, Python; really?!!', the component words, in sequence, would be 'Dynamic-typed' (noting that '-' here is not considered to be a word despite being non-alphabetic, as it is surrounded by alphabetic characters), 'variables', ',', 'Python', ';', 'really', '?', '!', '!'. Note here that, in the case of the document starting with 'Dynamic--typed', the breakdown into words would instead be 'Dynamic', '-', '-', and 'typed', as both of the hyphens neighbour a non-alphabetic letter. Note also that case should be preserved in the output (i.e. if a word is upper case in the original, it should remain in upper case).

Write a function authattr_worddict(doc) that takes a single string argument doc and returns a dictionary (dict) of words contained in doc (as defined above), with the frequency of each word as an int. Note that, as the output is a dict, the order of those words may not correspond exactly to that indicated below, and that the testing will accept any word ordering within the dictionary.

Here are some example calls to your authattr_worddict function:

>>> authattr_worddict('Dynamic-typed variables, Python; really?!!') {'Dynamic-typed': 1, 'Python': 1, 'really': 1, '!': 2, 'variables': 1, '?': 1, ',': 1, ';': 1}

>>> authattr_worddict('') {}

>>> authattr_worddict("Truly, rooly, rooly, indisputably 'tis ..... Gr00vy") {"'": 1, 'vy': 1, '.': 5, '0': 2, 'tis': 1, 'rooly': 2, 'Truly': 1, 'indisputably': 1, 'Gr': 1, ',': 3}

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!