Question: Task 1 : Parallel Corpora Parallel corpora contain a collection of texts in a given language and their translation to one or more other languages.

Task 1: Parallel Corpora
Parallel corpora contain a collection of texts in a given language and their translation to one or more other
languages. In this task, you will build a small parallel corpus using data from OpenSubtitles.org, a
database that allows you to search and download subtitles for various languages. It was previously used to
build the OpenSubtitles corpus, which consists of around 2.6 billion sentences and covers 60 languages.
Search for the film Monty Python and the Holy Grail (1975) on OpenSubtitles.org and download subtitles
for English, German, and a third language of your choosing. Open the files using a text editor (e.g. VS
Code) and familiarise yourself with the format. Your corpus will include sentences from a famous scene
that starts at 00:17:48(first English sentence is : Quiet! There are ways of telling whether she is a witch.),
and ends at 00:20:31(last English sentence is: ...knight of the Round Table.). Your goal is to clean up the
data, match subtitles in different languages and put the lines together, transforming them into the following
format:
line 1 in English
line 1 in German
line 1 in chosen language
line 2 in English
line 2 in German
line 2 in chosen language
...
You will see that this manual process is not feasible for greater amounts of data, and you will learn how to
automate a process like this later on in the course.
Save the created corpus as grail_corpus.txt and submit the file together with the assignment.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!