Question: How do I create a pyspark program that extracts all the N-grams from a csv file? So I use this DF = spark.read.format('csv').option('header', true).load(file.csv) I
How do I create a pyspark program that extracts all the N-grams from a csv file?
So I use this
DF = spark.read.format('csv').option('header', "true").load("file.csv")
I then make the DF an RDD by doing
RDD_DF = DF.rdd.map(lambda x:x[0])
I only want the first column. Now how do I create an N-gram filter?
So after running the code above I get something like this
['This was a good time. Apples.',
'dogs. mason prop',
There are many cats']
I wanna get a code that selects all n-grams so all the unigrams, -bigrams, trigrams, quadgrams etc. so if the code collected all unigrams it would return
Apples
dogs
Bigrams would return
mason prop
etc.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
