Question: 1. Build a Fourth-gram language model Each student needs to collect an Arabic corpus of at least 100,000 words, but the more is better. A

 1. Build a Fourth-gram language model Each student needs to collectan Arabic corpus of at least 100,000 words, but the more is

1. Build a Fourth-gram language model Each student needs to collect an Arabic corpus of at least 100,000 words, but the more is better. A bonus will be given if the corpus contains Arabic dialects. Students cannot use the same corpus, fully or partially. Write a program to tokenize the corpus into tokens/words, then build a 4-gram model for this corpus. That is, your language model is a table that contains: the token, the token counts, and the token probability. The language model should be saved in CSV format. 2. Develop a plagiarism detection interface Develop a program (in JAVA) that uses your language model to compute a plagiarism score for a given sentence. In other words, the user can write a sentence in Arabic, and when clicking "Go", the program will compute the probability of this sentence using the language mode. This probability should be tuned to reflect a plagiarism score. The more similar a given sentence (fully or partially) to sentences in the corpus the higher the plagiarism score. Example: Submission: corpus language model.csv, source code, and all files used to run the project. During the discussion, students will be also asked theoretical questions related NLP

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!