#——————————————

#——————————————

############## Homework #6 ##############

#——————————————

#——————————————

# Directions: cluster a sample of Amazon product

# reviews. Your sample will include 250 automotive products

# from a population of over 20,000 amazon product reviews, with

# corresponding product information.

#——————————————

######### Preliminary Code #########

#——————————————

#——————————————

#——————————————

#——————————————

#——————————————

library(tm)

library(cluster)

#——————————————

## Load Functions & Data (.RData File)

#——————————————

# load the data used in this HW

#——————————————

######### Solutions #########

#——————————————

## 1. First, learn about the objects that you loaded into your

# workspace. Next, set your birthday seed, before running the code in the

# answer section. In words, describe what is this code doing.

#——————————————

products <- sample(unique(autorevs\$asin), 250, replace=FALSE)

docs <- autorevs\$doc_id[autorevs\$asin %in% products]

#

#——————————————

## 2. Next, create a TDM and dataframe subsets based on

# the docs and products vectors created in step 1.

# How many documents are in your subsets?

#——————————————

#——————————————

## 3. First, we will cluster review text to find clusters of terms.

# First, create the distance matrix. Use the dist() function to create

# a distance matrix for the automotive review terms named rev_tdist.

# Then, perform hierarchical clustering using Ward’s Method.

#——————————————

#——————————————

## 4. Evaluate the best number of clusters, k, using plots of the average

# silhouette width and within-cluster SSE across k values to guide your choice.

# Consider k values up to 15. Based on your plots, how many clusters would your choose?

#——————————————

#——————————————

# distribution of terms. Are the terms evenly distributed across clusters?

#——————————————

#——————————————

## 6. Choose one of the clusters and view the terms in that cluster. Do

# they appear to be related? Explain.

#——————————————

#——————————————

## 7. Next, we will apply kmeans clustering to the documents. First, use the

# plot of the average silhouette width across k values up to 25 to choose

# the optimal k.

#——————————————

#——————————————

## 8. Use your choice of k from answer 7 and perform kmeans clustering. Plot

# the distribution of documents. Then, use the doc_clus_overview() function

# to view the cluster size and the most important terms in each cluster.

# Hint: don’t forget to apply the function to the DTM, not TDM!

# Hint 2: dont forget to set your seed!

#——————————————

#——————————————

## 9. Now that we know a little more about the naturally existing clusters

# of terms and documents, explore your dataframe subset further. Use

# any variables that may help you to understand the clustering solution?

# Which ones? Explain.

#——————————————