Question: Topic discovery via k-means. In this problem you will use k-means to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The Julia

Topic discovery via k-means. In this problem you will use k-means

Topic discovery via k-means. In this problem you will use k-means to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The Julia file wikipedia_corpus.jl contains the histograms as a list of 300 1000-vectors in the variable article_histograms. It also provides the list of article titles in article_titles and a list of the 1000 words used to create the histograms in dictionary. The file kmeans.jl provides a Julia implementation of the k-means algorithm in the function kmeans. The kmeans function accepts a list of vectors to cluster along with the number of clusters, k, and returns three things: the centroids as a list of vectors, a list containing the index of each vector's closest centroid, and a list of the value of J after each iteration of k-means. Each time the function kmeans is invoked it initializes the centroids by randomly assigning the data points to k groups and taking the k representatives as the means of the groups. (This means that if you run kmeans twice, with the same data, you might get different results.) For example, here is an example of running k-means with k = 8 and finding the 30th article's centroid. include("wikipedia_corpus.ji") include("kmeans.jl") using Kmeans centroids, labels, j_hist = kmeans(article_histograms, 8) centroids[labels[30]] The list labels contains the index of each vector's closest centroid, so if the 30th entry in labels is 7, then the the 30th vector's closest centroid is the 7th entry in centroids. There are many ways to explore your results. For example, you could print the titles of all articles in a cluster. 11 julia> article_titles[labels . == 7] 16-element Array{UTF8String, 1}: "Anemometer" "Black ice" "Freezing rain" ... Alternatively, you could find a topic's most common words by ordering dictionary by the size of its centroid's entries. A larger entry for a word implies it was more common in articles from that topic. julia> dictionary[sortperm(centroids[7],rev=true)] 1000-element Array{ASCIIString, 1}: "wind" "ice" "temperature" ... (a) For each of k = 2, k = 5, and k = 10 run k-means twice, and plot J (vertically) versus iteration (horizontally) for the two runs on the same plot. Create your plot by passing a vector containing the value of J at each iteration to PyPlot's plot function. Comment briefly on your results. Topic discovery via k-means. In this problem you will use k-means to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The Julia file wikipedia_corpus.jl contains the histograms as a list of 300 1000-vectors in the variable article_histograms. It also provides the list of article titles in article_titles and a list of the 1000 words used to create the histograms in dictionary. The file kmeans.jl provides a Julia implementation of the k-means algorithm in the function kmeans. The kmeans function accepts a list of vectors to cluster along with the number of clusters, k, and returns three things: the centroids as a list of vectors, a list containing the index of each vector's closest centroid, and a list of the value of J after each iteration of k-means. Each time the function kmeans is invoked it initializes the centroids by randomly assigning the data points to k groups and taking the k representatives as the means of the groups. (This means that if you run kmeans twice, with the same data, you might get different results.) For example, here is an example of running k-means with k = 8 and finding the 30th article's centroid. include("wikipedia_corpus.ji") include("kmeans.jl") using Kmeans centroids, labels, j_hist = kmeans(article_histograms, 8) centroids[labels[30]] The list labels contains the index of each vector's closest centroid, so if the 30th entry in labels is 7, then the the 30th vector's closest centroid is the 7th entry in centroids. There are many ways to explore your results. For example, you could print the titles of all articles in a cluster. 11 julia> article_titles[labels . == 7] 16-element Array{UTF8String, 1}: "Anemometer" "Black ice" "Freezing rain" ... Alternatively, you could find a topic's most common words by ordering dictionary by the size of its centroid's entries. A larger entry for a word implies it was more common in articles from that topic. julia> dictionary[sortperm(centroids[7],rev=true)] 1000-element Array{ASCIIString, 1}: "wind" "ice" "temperature" ... (a) For each of k = 2, k = 5, and k = 10 run k-means twice, and plot J (vertically) versus iteration (horizontally) for the two runs on the same plot. Create your plot by passing a vector containing the value of J at each iteration to PyPlot's plot function. Comment briefly on your results

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

12. Topic discovery via k-means. In this problem you will use k-means to cluster 500 Wikipedia articles selected from 5 broad groups of topics. To get the data, run the Julia commands using VMLS...

BOTH A AND B PLEASE> THANK YOU 2. [Vector Quantization] Please consider the 4 pixel by 4 pixel image below: 202,118,148 237,141,151 250,172,164 239, 166, 169 122,36,41 175,53,18 228,43,118 233,...

2. [Vector Quantization] Please consider the 4 pixel by 4 pixel image below: 202,118,148 237,141,151 250,172,164 239,166,169 122,36,41 175,53,18 228,43,118 233, 134,60 70,49,62 86,49,35 150,39,43...

The following case is adapted from deRond (2003) 1 . It explores a 1995-1998 strategic alliance between a large pharmaceutical firm and a start-up engaged in producing combinatorial chemistry...

Plethora & Rummidgen The following case is adapted from deRond (2003)1.It explores a 1995-1998 strategic alliance between a large pharmaceutical firm and a start-up engaged in producing combinatorial...

Plethora & Rummidgen The following case is adapted from deRond (2003) 1 . It explores a 1995-1998 strategic alliance between a large pharmaceutical firm and a start-up engaged in producing...

The following case is adapted from deRond (2003) 1 .It explores a 1995-1998 strategic alliance between a large pharmaceutical firm and a start-up engaged in producing combinatorial chemistry...

Plethora & Rummidgen The following case is adapted from deRond (2003) 1 . It explores a 1995-1998 strategic alliance between a large pharmaceutical firm and a start-up engaged in producing...

Homework #11: Strategic Leadership (Plethora/Rummidgen) // MANA-4322 // (part 2) *** Questions 4. Was the leadership of the alliance effective in creating a learning organization in which the...

Question: Please read this case and make an analysis as follows: Put more emphasis on how things changed and what changed in terms of performance and firm motives? Talk about the things that changed...

Answer the following question after reading the case study Coca-Cola in India: How might companies like Coca-Cola and PepsiCo demonstrate their commitment to working with different countries and...

Much of the information needed for assessing the quality and value-relevance of a companys reported accounting numbers cannot be found in the companys Form 10-K. True or False? The discounted cash...

A sales associate received a commission check for $ 4 , 4 4 6 on a piece of property she recently sold. What was the sales price if she was paid 5 5 percent of her broker s 6 percent commission?

Report Page Electrolysis, the Faraday Constant, and Avogadro's Number Name Date I. Measurements: Buret Readings Initial Final () (L) Current 2min. 3min. 4min. 5min. 6min. 7min 8min. Vol. Hal imin....

How has Departmental Computing increased the need for HCM Professionals and Technical Staff to be skilled in Business Computing Software and Systems?

Describe the difference between Two- and Three-Tier Computing Systems.

Explain the differences between On Premises, SaaS, PaaS, IaaS, and Hybrid Computing environments.