Question: Task 1 : Parallel Corpora Parallel corpora contain a collection of texts in a given language and their translation to one or more other languages.
Task : Parallel Corpora
Parallel corpora contain a collection of texts in a given language and their translation to one or more other
languages. In this task, you will build a small parallel corpus using data from OpenSubtitles.org, a
database that allows you to search and download subtitles for various languages. It was previously used to
build the OpenSubtitles corpus, which consists of around billion sentences and covers languages.
Search for the film Monty Python and the Holy Grail on OpenSubtitles.org and download subtitles
for English, German, and a third language of your choosing. Open the files using a text editor eg VS
Code and familiarise yourself with the format. Your corpus will include sentences from a famous scene
that starts at ::first English sentence is : Quiet There are ways of telling whether she is a witch.
and ends at ::last English sentence is: knight of the Round Table. Your goal is to clean up the
data, match subtitles in different languages and put the lines together, transforming them into the following
format:
line in English
line in German
line in chosen language
line in English
line in German
line in chosen language
You will see that this manual process is not feasible for greater amounts of data, and you will learn how to
automate a process like this later on in the course.
Save the created corpus as grailcorpus.txt and submit the file together with the assignment.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
