Over the last few days the hashtag #7FaveGames has been trending on
twitter. As you can probably guess you’re supposed to list your top seven games of
all time and put them out to the world to be judged!
Of course, I jumped straight into this trend and submitted my own faves - if
you’re interested, my choices are in the banner of this post.
Whilst trawling my own gaming history, I started to
wonder about the most popular games overall. A great excuse to
pull out the R analytics toolkit and do some amateur data science.
The first step was to gather as many #7FaveGames tweets as
possible without crossing the boundaries set by the twitter API (180
requests per 15 minutes).
I decided to store all tweets with their metadata in a CSV file. This
makes the data much easier to explore and reload - specifically for
the numerous cases where I accidently trashed my in-memory copy.
Initially we want to set some parameters and pull the first
batch. This initial chunk is used to create the CSV file to which we
will append during the rest of the process.
# define batch size
Next we just paginate, in descending chronological order, through
the rest of the data using the maxID key to shift the query
window. The last id in the current batch is used as the maxID in the
next to ensure a contiguous set of tweets.
We just keep looping through until we hit a batch smaller than
expected, or a NA id.
Admittedly it’s a pretty basic approach, but it does the job … in a
few short hours …
# while we still have tweets to read, pull batches
First we import a bunch of libraries that we will need.
- tm and SnowballC for text mining and transformation
- RColorBrewer to pick a color palette
- wordcloud to plot the cloud itself
We read in the data we collected in the last step. For some unknown reason the
data is coming back in an odd
encoding. We need to force it to UTF-8 to avoid the “tm” algorithms
crashing out half-way through.
data <- read.csv("7favegames.csv", row.names=NULL, stringsAsFactors = FALSE)
Many of the video game titles include roman numerals (e.g. Final
Fantasy X) or accented characters (e.g. Pokémon). We want to simplify
these down to a common form so we don’t end up with duplicates
(such as Pokémon vs Pokemon).
The roman numeral converter is pretty primitive but, I think, fine for this
purpose. The odd ordering of the numbers is so that we’re greedy in
our matching of independent roman numerals (i.e. we don’t match a
substring of a numeral and corrupt the value).
This is where the bulk of the transformations occur. We first create a
text corpus from our tweets and …
- Convert to lower case
- Remove any punctuation
- Remove stop words, such as “the” and “and”. Or those specific to tweets(“rt” and “lol”), and video games (e.g. “super” to avoid common words in all those SNES games).
corpus <- Corpus(VectorSource(tweets))
Finally, now that we have our simplified corpus, we can plot our word
We choose a fairly solid color scheme, avoiding too many of the weaker
shades that would be hard to read. You can see the full selection of palettes by running display.brewer.all()
# plot word cloud
And here’s the end result:
Clearly Mario and Pokemon are the most popular - no real surprise there -
but it’s interesting to see other titles such as Dark Souls, Bioshock
and Bayonetta pop up. There are a couple of newer titles that make an appearance too, such as
Overwatch and Rocket League, but this may be partially due to some
kind of recency bias.
Admittedly the model does have some fundamental flaws though, including:
- Not accounting for the difference between a single game and
series. For example the Final Fantasy saga will get over-represented
due to the sheer number of releases.
- Titles with more complicated names, or names prone to acronym, will
get under-represented due to the algorithm not being able to tie
- Tweets that repeat a single game over and over will add unfairly to
But regardless, it’s a fun experiment to run, and does give us some
idea of the relative popularity of the various titles.