The Applied Sciences team is seeking to create a sub-word tokeniser to preprocess free- text fields...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
The Applied Sciences team is seeking to create a sub-word tokeniser to preprocess free- text fields for downstream analysis. You've been asked to write a function tokenise that does the following: 1. Accepts two arguments, raw_text and window_size, and returns a list of tuples, where the tuple contains the sub-word tokens for each word in raw_text 2. Splits raw_text into individual words on spaces and punctuation (except apostrophes) 3. Appends angle brackets to either end of a word to delimit the start and end of the word (e.g. <word>) 4. Creates sub-word tokens of window_size length. If a word (including the angle brackets) is shorter than window_size, then no sub-word tokens other than the special token (step 5) should be generated. 5. Appends a single, special token at the end of the sub-word token list containing the entire word with angle brackets NB. The implementation and partial solutions will be assessed too. Please don't panic if you can't pass all of the test cases! Example ### Running your function >>> tokenise (raw_text="hello, world", window_size=3) ### Returns ### NOTE: the output below has been formatted for readability. ### Your function just needs to output the list of tuples [ ] ('<he', 'hel', 'ell', 'llo', 'lo>', '<hello>'), ('<wo', 'wor', 'orl', 'rld', 'ld>', '<world>') # Complete the 'tokenise' function below. # # The type signatures have been completed for you # You may use helper functions to modularise your code def tokenise (raw_text: str, window_size: int) -> List [Tuple [str]]: # Write your code here name__ == ' __main__':- > if 11 The Applied Sciences team is seeking to create a sub-word tokeniser to preprocess free- text fields for downstream analysis. You've been asked to write a function tokenise that does the following: 1. Accepts two arguments, raw_text and window_size, and returns a list of tuples, where the tuple contains the sub-word tokens for each word in raw_text 2. Splits raw_text into individual words on spaces and punctuation (except apostrophes) 3. Appends angle brackets to either end of a word to delimit the start and end of the word (e.g. <word>) 4. Creates sub-word tokens of window_size length. If a word (including the angle brackets) is shorter than window_size, then no sub-word tokens other than the special token (step 5) should be generated. 5. Appends a single, special token at the end of the sub-word token list containing the entire word with angle brackets NB. The implementation and partial solutions will be assessed too. Please don't panic if you can't pass all of the test cases! Example ### Running your function >>> tokenise (raw_text="hello, world", window_size=3) ### Returns ### NOTE: the output below has been formatted for readability. ### Your function just needs to output the list of tuples [ ] ('<he', 'hel', 'ell', 'llo', 'lo>', '<hello>'), ('<wo', 'wor', 'orl', 'rld', 'ld>', '<world>') # Complete the 'tokenise' function below. # # The type signatures have been completed for you # You may use helper functions to modularise your code def tokenise (raw_text: str, window_size: int) -> List [Tuple [str]]: # Write your code here name__ == ' __main__':- > if 11
Expert Answer:
Answer rating: 100% (QA)
To create the tokenise function based on the requirements youve pro... View the full answer
Related Book For
Smith and Roberson Business Law
ISBN: 978-0538473637
15th Edition
Authors: Richard A. Mann, Barry S. Roberts
Posted Date:
Students also viewed these programming questions
-
For Hy-Vee Inc. identify one strategic goal and trace that goal through the tactical and operational levels of planning with one specific example at each level. Specify the appropriate person in the...
-
You are an independent consultant, providing strategy, execution, and project management services to several clients, across a variety of different project types. Your skills are valued by these...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Randi Corp. is considering the replacement of some machinery that has zero book value and a current market value of $3,700. One possible alternative is to invest in new machinery that costs $30,900....
-
Capriati Corporation commenced operations in early 2010. The corporation incurred $60,000 of costs such as fees to underwriters, legal fees, state fees, and promotional expenditures during its...
-
Explain how the structure of simple squamous epithelium provides its function.
-
Sketch and describe the magnetic poles of a spherical piece of uniformly magnetized material.
-
Pannier Company is the parent company that owns an 80% interest in Jodestar Company. The interest was acquired at book value, and the simple equity method is used to record the ownership interest....
-
1.Discuss the rationale and significance of the Security Market Line (SML) as a representation of the valuation of risky securities. 2.Outline five of the positive characteristics that are...
-
In this question, you are given a Timed Petri net as shown below. The time stamps of the tokens (i.e., black dots in some places) are also provided on the figure below. The delay on the transition T0...
-
The information given below was taken from the payroll records of Clegg Company (Oregon employer) for 20--. Use the information to complete the partially illustrated Form 940 shown below. Assume that...
-
Suppose a political scientist is interested in whether wealth is a determining factor in the individuals propensity to vote. A random sample of 500 people who earned $ 100,000 or more showed that 390...
-
A poll was done to predict the outcome of the upcoming election. Of the 900 potential voters who responded, 500 plan to vote for the incumbent. If a candidate needs 50 % of the votes to win the...
-
A mutual fund manager claims that the returns of stocks in her fund have a variance of no more than .50. A random sample of 25 stocks in her fund has a sample variance of .72. Assuming that the...
-
A muffler manufacturer claims that the variance of its product is no more than 200. A random sample of 25 mufflers has a sample variance of 391. Assuming a normal distribution, test the manufacturers...
-
The IRS is interested in knowing whether people who have an accountant prepare their tax returns have fewer errors than people who prepare their own returns. A random sample of 500 people who had...
-
he 5. Instant cold packs use an endothermic reaction. These types of ice packs are stored at room temperature rather than needing to be physically cooled before use. When one breaks a tube inside the...
-
A glass manufacturer produces hand mirrors. Each mirror is supposed to meet company standards for such things as glass thickness, ability to reflect, size of handle, quality of glass, color of...
-
During the years prior to the passage of the Civil Rights Act of 1964, Duke Power openly discriminated against African Americans by allowing them to work only in the labor department of the plants...
-
This is a stocklist case arising under 220(b) of our [Delaware] General Corporation Law. The issue is whether a shareholder states a proper purpose for inspection under our statute in seeking to...
-
Civil Code 1719, subdivision (a) provides in part that any person who draws a check that is dishonored due to insufficient funds shall be liable to the payee for the amount owing upon the check and...
-
Where does the management accounting function fit into an organizations structure?
-
What are the ethical responsibilities of management accountants?
-
Diana Corporation provides the following information for 2017: Calculate (a) Cost of goods manufactured in 2017 and (b) Cost of goods sold in 2017 Beginning work-in-process inventory, 1/1/2017 Total...
Armstrongs Handbook Of Human Resource Management Practice 1st Edition - ISBN: 0749465506 - Free Book
Study smarter with the SolutionInn App