Question: Topic discovery via k-means. In this problem you will use k-means to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The Julia

Topic discovery via k-means. In this problem you will use k-means to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The Julia file wikipedia_corpus.jl contains the histograms as a list of 300 1000-vectors in the variable article_histograms. It also provides the list of article titles in article_titles and a list of the 1000 words used to create the histograms in dictionary. The file kmeans.jl provides a Julia implementation of the k-means algorithm in the function kmeans. The kmeans function accepts a list of vectors to cluster along with the number of clusters, k, and returns three things: the centroids as a list of vectors, a list containing the index of each vector's closest centroid, and a list of the value of J after each iteration of k-means. Each time the function kmeans is invoked it initializes the centroids by randomly assigning the data points to k groups and taking the k representatives as the means of the groups. (This means that if you run kmeans twice, with the same data, you might get different results.) For example, here is an example of running k-means with k = 8 and finding the 30th article's centroid. include("wikipedia_corpus.ji") include("kmeans.jl") using Kmeans centroids, labels, j_hist = kmeans(article_histograms, 8) centroids[labels[30]] The list labels contains the index of each vector's closest centroid, so if the 30th entry in labels is 7, then the the 30th vector's closest centroid is the 7th entry in centroids. There are many ways to explore your results. For example, you could print the titles of all articles in a cluster. 11 julia> article_titles[labels . == 7] 16-element Array{UTF8String, 1}: "Anemometer" "Black ice" "Freezing rain" ... Alternatively, you could find a topic's most common words by ordering dictionary by the size of its centroid's entries. A larger entry for a word implies it was more common in articles from that topic. julia> dictionary[sortperm(centroids[7],rev=true)] 1000-element Array{ASCIIString, 1}: "wind" "ice" "temperature" ... (a) For each of k = 2, k = 5, and k = 10 run k-means twice, and plot J (vertically) versus iteration (horizontally) for the two runs on the same plot. Create your plot by passing a vector containing the value of J at each iteration to PyPlot's plot function. Comment briefly on your results. Topic discovery via k-means. In this problem you will use k-means to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The Julia file wikipedia_corpus.jl contains the histograms as a list of 300 1000-vectors in the variable article_histograms. It also provides the list of article titles in article_titles and a list of the 1000 words used to create the histograms in dictionary. The file kmeans.jl provides a Julia implementation of the k-means algorithm in the function kmeans. The kmeans function accepts a list of vectors to cluster along with the number of clusters, k, and returns three things: the centroids as a list of vectors, a list containing the index of each vector's closest centroid, and a list of the value of J after each iteration of k-means. Each time the function kmeans is invoked it initializes the centroids by randomly assigning the data points to k groups and taking the k representatives as the means of the groups. (This means that if you run kmeans twice, with the same data, you might get different results.) For example, here is an example of running k-means with k = 8 and finding the 30th article's centroid. include("wikipedia_corpus.ji") include("kmeans.jl") using Kmeans centroids, labels, j_hist = kmeans(article_histograms, 8) centroids[labels[30]] The list labels contains the index of each vector's closest centroid, so if the 30th entry in labels is 7, then the the 30th vector's closest centroid is the 7th entry in centroids. There are many ways to explore your results. For example, you could print the titles of all articles in a cluster. 11 julia> article_titles[labels . == 7] 16-element Array{UTF8String, 1}: "Anemometer" "Black ice" "Freezing rain" ... Alternatively, you could find a topic's most common words by ordering dictionary by the size of its centroid's entries. A larger entry for a word implies it was more common in articles from that topic. julia> dictionary[sortperm(centroids[7],rev=true)] 1000-element Array{ASCIIString, 1}: "wind" "ice" "temperature" ... (a) For each of k = 2, k = 5, and k = 10 run k-means twice, and plot J (vertically) versus iteration (horizontally) for the two runs on the same plot. Create your plot by passing a vector containing the value of J at each iteration to PyPlot's plot function. Comment briefly on your results
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
