June 12, 2014

How to Find the Most Popular Elements of a Dataset

I needed to find the most common member of an R data frame. A while back, I found the tm package for text mining. It turns out there is a paper on the package, which illustrates how to use it. Indeed, it's practically what I needed. A few tweaks later and.. it worked. The only thing left is to share the code with the lot of you:

require(tm)
corpus = Corpus(VectorSource(yesterday$Link)) # yesterday is a data frame containing the vector link

tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)

# this was the key line, specifically, the rowSums function
v <- sort(rowSums(m), decreasing = TRUE)

# for some reason, names(v) has a leading space, so exterminate it
names <- sub(' ','', names(v))

cat(paste(names[1:3], '\n'))

No comments:

Post a Comment