What’s driving users of social networks to engage into conversations? More specifically, what’s driving them to discuss the news? That’s what I was wondering. From 8 December until 10 January I collected 1.3 million comments on articles that were submitted within 24 hours of the article’s initial appearance on the Fox News Facebook page. Although one can argue if Fox ‘News’ is actually a news network, it is a very important forum where people go to express their opinion in the United States. Are there topics that encourage them to go into discussion? Some trigger words? Here’s what I found.
In the period of 8 December until 10 January, the average post on the Fox News Facebook page that linked to a news article received 1550 comments. In the first graph we can see that a disproportionally large part of a post’s comments is generated within the first hours of its appearance – every line is a Facebook post. The amount of comments that are added after 24 hours since the post’s appearance is marginally low. And indeed, while there are some posts that generate more than 5000 comments within 24 hours, most posts don’t, averaging to 1550 comments.
If we sum up all the lines from the graph above, and group the comments per hour we can see that the first two hours are the most important time frame for a Facebook post to generate comments – or in marketing terms ‘to be engaged with’.
Furthermore, for those familiar with network science, we can clearly see the preferential attachment mechanism at work. I did a log transformation of the amount of comments within the first hour of the post’s appearance (on the x-axis) and of the first 24 hours of appearance (on the y-axis). What we see here is that the posts that received the most comments in the first hour also generated the most comments in the other 23 hours. These posts disproportionally enjoy a higher exposure as they appear more in people’s news feeds because people respond to it, triggering even more Facebook users to comment on it.
That’s all pretty cool, I know, because we can more or less predict the total amount of comments on a Facebook post if we track it within the first hour of its existence. Because this is a power law, we can even predict it after 5 minutes. I kid you not.
However, what we’re really interested in is if and how we can predict the amount of comments before the article is posted to the Facebook page.
Before we proceed, you should know that, in order for parametrised regression modelling techniques to function properly, I did a log-transformation of the dependent variable (comment counts) in all of the following graphs. Log-transforming the data made it normal-ish, as you can see from this quantile-quantile plot.
Lastly, I transformed all hours to Pacific Standard Time.
What if we were able to determine the sentiment of the article that the Facebook post links too? Wouldn’t that be great. Of course. And not only is it great, it’s also fairly easy. There exist several databases that link words to an associated feeling or connotation. The NRC Sentiment and Emotion Lexicon associates 6 emotions and 2 sentiments to a word. I scraped all the 650-ish articles behind the Facebook posts and cross-referenced every word from the article with the NRC lexicon to determine the sentiment of every article. Furthermore, I normalized the number of matches of each emotion to the total amount of matches of every emotion/sentiment.
In the following graph we can see that some sentiments and emotions produce more response than others. For example, posts that are rather positive then to received less comments than posts that are negative. Furthermore, the more articles are associated with anger, disgust and sadness, the more comments they tend to have received. For emotions such as anticipation, fear and joy, the relation is the other way around; e.g. the more joyful an article is, the less comments it receives.
I also clustered the articles by topic. Fox news does not use tags on their website so there wasn’t a straightforward supervised way to classify these articles. So I used a very popular unsupervised document classification method called Latent Dirichlet Allocation. Determining the appropriate number of topics proved to be the hardest part. There was a ready-made package in R that allowed me to brute-force-find the optimal number of topics. I used Juan Cao’s density-based method to find a local minimum which gave me 17 topics after a fair amount of patience and processing time.
I arbitrarily labeled the topics as: Christmas, the special election in Alabama, Israel & the Middle East, a variety of scandals, women’s rights and the #metoo movement, religion, airport chaos, new year, the tax bill, North Korea, wildfires in California, the FBI investigations (both Trump- & Clinton-related), Trump and his tweets, crime in general, global terrorism, migration and a leftover category.
Articles that are related to the holidays (Christmas & the New Year) on average receive less comments. Yet articles that are related to the Alabama special election, the FBI investigations, president Trump‘s tweets, women’s rights and scandals in general received more comments. Identitarian topics such as religion and migration also seemed to have done well in receiving comments.
Are there words, terms or names that trigger the social network’s users to respond to the article? I looked at the top 50 words that appeared in titles of Facebook posts and arbitrarily selected some that I think could of significance.
The words that I selected were: trump, opinion, christmas, korea, tax, california, sexual, senate, fire, war, israel, clinton, russia and alabama. Yes, I am aware that many of these trigger words correlate with the topics. While I haven’t taken the time to do it, the jaccard index could be used to test that hypothesis.
While there is no clear relationship between the presence of many of these words in the post title there’s one that stands out: ‘Clinton‘. It should be of no surprise that the presidential candidate is the antagonist of Fox News’ mainly republican audience.
Last but not least, since most comments are generated within the first two hours of the post’s appearance, the time frame that the article is posted in can be very relevant. This is what we can see in the following graph. Articles posts between 7am and 9am; and 8pm and 10pm tend to receive more comments. But the time window that really stands out is around noon, when people have a break at work.
So, the final question: with all these insights, can we completely predict how many comments a post on Fox News’ Facebook page will receive in the first 24 hours?
I have tried three closely related modeling techniques. A regression tree with cross-validation produced nothing worth mentioning. A non-cross-validated regression with a forward step-wise selection of features produced an R²-value of 0.07. A lasso regression, with cross-validation, produced an R²-value of 0.12. The determining features in this model are:
- Emotion: fear (negative relation)
- Emotion: sadness (positive relation)
- Sentiment: positive (negative relation)
- Is the article posted at 1pm (positive relation)
- Is the article related to the holidays (negative relation)
- Is the article related to migration (positive relation)
- Is the article related to women’s rights (positive relation)
- Is the article related to the California wildfires (negative relation)
- Does the post title contain the name ‘Clinton‘ (positive relation)
No, an R²-value of 0.12 tells us that the predictability of our model (the goodness of fit) is quite low. However, we were able to identify some interesting factors that drive the engagement of Facebook users on Fox News’ page.
In the next graph I compare the real (log-transformed) amount of comments of the test set with the predicted amount of comments by the lasso regression model. The closer a dot is to a graph, the better it was predicted. Red dots are overestimated while blue dots are underestimated by the model. Residual plots here, here and here.
Conclusion: What’s driving engagement on the Facebook pages of Fox News? Finding out has proven not to be an easy task. Topics, sentiment & emotion, time of posting; they all have some impact, but the predictive power of these factors has proven to be quite low. What could be done to give the model more predictive power? I think using other modelling techniques will only marginally improve the accuracy of the model. The most promising approach would be:
- Engineering more features. Applying image content analysis algorithms such as Google Vision on the post images could be promising. Many articles correlate with president Trump’s Twitter feed; e.g. his Don-Quixotesque battle against the national anthem protests in the NFL. So to predict post engagement, one could include a dummy feature that accounts for the fact that Trump tweeted about the subject in the last 72 hours.
- Gaining insight in the Facebook algorithm. But you and I know that’s not going to happen :-).
Code can be found here.
Data set on request. Might be added to Kaggle in the future.