Analyzing Subtitles - Chick Flick or Guy Movie?

Romantic comedy or action movie? While this cliché certainly oversimplifies the differences in movie preferences between men and women, it is common knowledge that certain movie genres primarily target one gender. The colloquial term chick flick refers to movies targeted to a female audience whereas guy movies are mainly aimed at male viewers. For example, a study found that female viewers have a stronger preference for movies with happier themes than their male counterparts (Banerjee et al., 2008).

For this post, my goal is to investigate how chick flicks, and guy movies differ regarding the words spoken. As obtaining the original scripts is rather difficult, I decided to analyze the movies’ subtitles instead, which were available from Amazon Video. At the end of this post, you will find a brief description of how subtitles can be downloaded from the Amazon website. Since subtitles are protected by copyright, please understand that I cannot share the files that I used in this project.

Data

I downloaded the subtitles of the top 1,000 Amazon Video bestsellers (7th March, 2017) and gathered additional information from IMDb. To determine whether a film is targeted at a male or female audience, I used the proportion of women votes on the IMDb score. In this sample, women make up on average about one-quarter of all gender votes.

25% of the movies with the lowest share of female votes were defined as guy movies while the top 25% were defined as chick flicks. Next, I removed movies with less than 1,000 total votes to ensure a fair comparison. Considering the rather vague definition of these two terms, I believe that this approach is sufficient for this context. The following table shows the top ten movies for each category:

# Guy Movies (% female votes) Chick Flicks (% female votes)
1 Never Back Down: No Surrender (3.8%) Northanger Abbey (81.7%)
2 Justice League Dark (4.8%) Persuasion (72.3%)
3 Dragon Ball Z: Resurrection (5.1%) The Last Song (65.4%)
4 Batman: Bad Blood (5.2%) The Princess Diaries 2 (64.1%)
5 Justice League: Throne of Atlantis (5.6%) Something Borrowed (62.6%)
6 Kill Command (5.7%) Letters To Juliet (62.5%)
7 Justice League vs. Teen Titans (6.3%) Ever After: A Cinderella Story (60.7%)
8 Headshot (6.4%) Little Women (59.7%)
9 We Were Soldiers (6.6%) 27 Dresses (58.5%)
10 Predator (6.7%) The Young Victoria (58.2%)

Analysis

First, I used the ‘wordcloud’ package in R to plot a comparison word cloud, which highlights the words that were heavily used in one of the two categories. To remove movie specific words, such as character names, I dropped all words from the sample that were not in at least three different movie subtitles.

Subtitles Wordcloud

The more frequent use of profane language in guy movies and the higher use of positive words in chick flicks is in line with the assumption that women prefer relationship movies while men are more interested in thriller and action genres. A quick sentiment analysis performed through the ‘sentimentr’ R-package confirms that movies targeted at a female audience are slightly more positive than movies produced for a male audience.

Subtitles Sentiment

These results show that the use of certain words differs significantly between the two categories. Furthermore, these findings suggest that we can use subtitles to determine whether a movie is more likely to target a male or female audience. As this is a traditional text classification problem, a wide variety of machine learning algorithms exist, such as support vector machines, naive Bayes classifiers, or boosting classifiers, which could be applied.

Even though some of these algorithms would probably be more accurate and robust for this task, I used a classification tree model (‘rpart’, R-package) because it produces results that are easy to understand and apply. The goal of this method is to derive a set of rules from input variables that can predict which class an item belongs to. In this case, the input variables for each movie are the numbers of occurrences of each word stem. The target category is either chick flick or guy movie. The following tree is based on a randomly selected training set.

Subtitles Regression Tree

According to this model, if the word stem ‘love’ appears more than 11 times in the subtitles and ‘gun’ is only used once, the movie is most likely targeted at women. However, if ‘love’ occurs no more than 11 times in the subtitles and ‘hell’ at least once, the movie is probably produced for a male audience. When applied to the test set, this model predicts 82% of the categories correctly. This is significantly higher than the 54% no information rate, which is the percentage of the largest class in the training set.

Overall, I think it is quite interesting to see how the results support the assumption about typical chick flicks and guy movies. Knowing that gender specific movies are a controversial topic for some people, I hope that no one is offended by this post or my choice of words as this is unintentional.

Download Subtitles from Amazon Video

Amazon Video offers high quality subtitles that can be downloaded in the DFXP format using Google Chrome’s Developer Tools. To get the subtitles, you need to start streaming the requested movie. Therefore, being a Prime member is very helpful as it includes free access to a large portion of the video content. Otherwise, gathering the subtitles for a large sample of movies could become rather expensive.

Subtitles Screenshot

  1. Visit the movie’s product page (e.g. amazon.com/dp/B01CUVU7DQ)
  2. Open Developer Tools
  3. Select the Network tab
  4. Click the Watch Now button
  5. Type ‘subtitle’ into the search bar
  6. Select GetPlaybackResources?… and open the Preview tab
  7. Open the URL under [subtitleURLs > 0 > url]
  8. Done!

If you have any questions or concerns about this post, feel free to write me an email.