What are the most common words in TV shows?

Nowadays, there are some many great shows on television or available for streaming. After binge-watching the first season of Jessica Jones, I knew that I had to post something about this topic. I have to admit that I never spend much time on text analysis, so I decided to start with the basics. The first thing I did was to create simple word clouds based on the subtitles of Jessica Jones, The Blacklist, and TBBT. They are an easy way to visualize how frequently certain words are used in a text or TV show.

Besides this basic type of word cloud, I also created a slightly different version of it. For this, I did not just count the words but also looked how close they are to a character’s name. This means that the more frequent a word is used in combination with a character’s name, the bigger it is in the word cloud.

I created all those word clouds with the R package wordcloud which makes this job so easy. Besides that, the findAssocs() function from the tm package was used calculate the often a word is used close to a character’s name. For the second type of word clouds, I used the correlation coefficients from this function as a weighting factor.

Before doing that, I replaced all the different versions of a name (e.g., Lizzy, Liz, Keen) with the name you can see in the center of the word clouds. In case you are wondering why there are not always complete words in the clouds, it is because of a process called stemming which reduces words to their word stem. The idea behind it is to group the different forms of a word together instead of treating every single form as an own term.

I hope you like this post. Please feel free to contact me if you have any questions or suggestions.

Menu