#7FaveGames - Creating a Word Cloud from 346,859 Tweets

Over the last few days the hashtag #7FaveGames has been trending on
twitter. As you can probably guess you’re supposed to list your top seven games of
all time and put them out to the world to be judged!

Of course, I jumped straight into this trend and submitted my own faves - if
you’re interested, my choices are in the banner of this post.

Whilst trawling my own gaming history, I started to
wonder about the most popular games overall. A great excuse to
pull out the R analytics toolkit and do some amateur data science.

Gathering all those Tweets

The first step was to gather as many #7FaveGames tweets as
possible without crossing the boundaries set by the twitter API (180
requests per 15 minutes).

I decided to store all tweets with their metadata in a CSV file. This
makes the data much easier to explore and reload - specifically for
the numerous cases where I accidently trashed my in-memory copy.

Initialize CSV Store

Initially we want to set some parameters and pull the first
batch. This initial chunk is used to create the CSV file to which we
will append during the rest of the process.

1
2
3
4
5
6
7
8
# define batch size
batch <- 500
sleepPeriod <- 10
# pull first batch and initialize output file
tweets <- searchTwitter("#7FaveGames", n=batch, retryOnRateLimit=batch)
parsed <- do.call("rbind.fill", lapply(tweets, as.data.frame))
write.table(parsed, file="7favegames.csv", sep=",", append=TRUE, col.names=TRUE)

Grab the Rest of the Data

Next we just paginate, in descending chronological order, through
the rest of the data using the maxID key to shift the query
window. The last id in the current batch is used as the maxID in the
next to ensure a contiguous set of tweets.

We just keep looping through until we hit a batch smaller than
expected, or a NA id.

Admittedly it’s a pretty basic approach, but it does the job … in a
few short hours …

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# while we still have tweets to read, pull batches
# (with respect to twitter so I don't get kicked)
pull <- TRUE
while (pull) {
# grab the last id of the previous batch
id <- parsed$id[batch]
print(c("Getting batch with id ", id))
# perform a paginated search on twitter
tweets <- searchTwitter("#7FaveGames", n=batch+1, retryOnRateLimit=batch, maxID=id)
parsed <- do.call("rbind.fill", lapply(tweets, as.data.frame))
# strip the first record to avoid duplicates, as it was the last in the previous batch
parsed <- parsed[-(1:1),]
write.table(parsed, file="7favegames.csv", sep=",", append=TRUE, col.names=FALSE)
pull <- (nrow(parsed) > 1) || (!is.na(id))
# just pause for a moment to avoid hammering the twitter api
# TBH the 'retryOnRateLimit' would probably do the job better anyway
Sys.sleep(sleepPeriod)
}

Building a Word Cloud

Which Libraries?

First we import a bunch of libraries that we will need.

  • tm and SnowballC for text mining and transformation
  • RColorBrewer to pick a color palette
  • wordcloud to plot the cloud itself
1
2
3
4
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

Change Encoding

We read in the data we collected in the last step. For some unknown reason the
data is coming back in an odd
encoding. We need to force it to UTF-8 to avoid the “tm” algorithms
crashing out half-way through.

1
2
3
data <- read.csv("7favegames.csv", row.names=NULL, stringsAsFactors = FALSE)
tweets <- data$text
tweets <- enc2utf8(tweets)

Convert “Odd” Characters and Numbers

Many of the video game titles include roman numerals (e.g. Final
Fantasy X) or accented characters (e.g. Pokémon). We want to simplify
these down to a common form so we don’t end up with duplicates
(such as Pokémon vs Pokemon).

The roman numeral converter is pretty primitive but, I think, fine for this
purpose. The odd ordering of the numbers is so that we’re greedy in
our matching of independent roman numerals (i.e. we don’t match a
substring of a numeral and corrupt the value).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
replaceRoman <- function(vect) {
numerals <- c("xviii", "xvii", "xiii", "viii", "xiv", "xvi", "xix", "vii", "iii", "xii", "ii", "iv", "vi", "ix", "xi", "xv", "xx", "i", "v", "x")
numbers <- c(18, 17, 13, 8, 14, 16, 19, 7, 3, 12, 2, 4, 6, 9, 11, 15, 20, 1, 5, 10)
for (i in c(1:length(numerals))) {
vect <- gsub(paste("\\b", numerals[i], "\\b", sep=""), numbers[i], vect)
vect <- gsub(paste("\\b", toupper(numerals[i]), "\\b", sep=""), numbers[i], vect)
}
return(vect)
}
tweets <- replaceRoman(tweets)
# remove accented characters
tweets <- iconv(tweets, to='ASCII//TRANSLIT')

Perform Corpus Transformations

This is where the bulk of the transformations occur. We first create a
text corpus from our tweets and …

  • Convert to lower case
  • Remove any punctuation
  • Remove stop words, such as “the” and “and”. Or those specific to tweets(“rt” and “lol”), and video games (e.g. “super” to avoid common words in all those SNES games).
1
2
3
4
5
corpus <- Corpus(VectorSource(tweets))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, c("7favegames", "amp", "rt", "lol", "follow", "twitter", "series", "new", "super", stopwords("english")))

Plot Word Cloud

Finally, now that we have our simplified corpus, we can plot our word
cloud.

We choose a fairly solid color scheme, avoiding too many of the weaker
shades that would be hard to read. You can see the full selection of palettes by running display.brewer.all()

1
2
3
# plot word cloud
pal <- brewer.pal(8,"Dark2")
wordcloud(corpus, max.words = 200, random.order=TRUE, scale=c(5, .5), colors = pal)

The End Result

And here’s the end result:

7favegames Word Cloud

Clearly Mario and Pokemon are the most popular - no real surprise there -
but it’s interesting to see other titles such as Dark Souls, Bioshock
and Bayonetta pop up. There are a couple of newer titles that make an appearance too, such as
Overwatch and Rocket League, but this may be partially due to some
kind of recency bias.

Admittedly the model does have some fundamental flaws though, including:

  • Not accounting for the difference between a single game and
    series. For example the Final Fantasy saga will get over-represented
    due to the sheer number of releases.
  • Titles with more complicated names, or names prone to acronym, will
    get under-represented due to the algorithm not being able to tie
    those together.
  • Tweets that repeat a single game over and over will add unfairly to
    its weight.

But regardless, it’s a fun experiment to run, and does give us some
idea of the relative popularity of the various titles.

Share Comments