Statnik 13: The Sputnik World Cup Sentiment

Published: June 29, 2018

stat_banner

Hello fellow football fans, and welcome to a post that will assign numbers to your footy feelings. On the eve of the 2018 World Cup, I noticed that no one had made a “World Cup Thread”-type list, so I decided to start one. At some point, I realized that I could leverage the comments people were making into some sweet, sweet content. Specifically, I sought to measure the sentiment of each comment (positive or negative) which I could then summarize by World Cup team and by user.

Recently, I was googling sentiment analysis and came upon this post. The post describes and has code for a model that uses the words of tweets to predict the sentiment of each tweet (a sentiment of 1 being positive, 0 negative). The post is from about a year ago, uses tweets as training data not sputnik comments, so it may not exactly match the vocabulary and sincerity level of our own sputnik soccer commenters. Regardless, I fit the sentiment model from the code in that post, scraped the comments from the World Cup 2018 list following the conclusion of the group stages, and then fit the sentiment model to each comment. The model assigns each comment a value between 0 and 1, 1 being positive, and 0 being negative. Most comments lie somewhere in the middle, ~ 75% of comments are between .25 and .75 and ~ 92% are between .1 and .9.

So, after classifying every comment with that model, I searched the comments for mentions of each team. (Note: I used regular expression to find them, so if you were playing the pronoun game when referring to a team, I could not detect a comment as containing a particular country.) Many comments mentioned multiple teams, some that seemingly had multiple sentiments within them, so I decided to only assign a label to comments that only mentioned one country.

The model has a fairly good test AUC of .87 (meaning that it performs pretty well on predicting sentiment out of sample), but it isn’t perfect. For instance here are some high sentiments comments(>.9) that seem off:

Are there any proud prostitution moments?

let’s see what belgium does

Told ya

come off it will ya

Kane is awesome and thicc as fuck agreed

The model is also not a good lie detector, apparently. Here are some weird low sentiment comments (<.15):

Modric tho

coutinho god damn

damn that was intense

Another problem, when someone repeats someone else’s comment but with a “[2]”, it will give slightly different sentiments. For instance, “Argentina are shockingly awful ” and “Argentina are shockingly awful [2]” get a value of .099 and .125 respectively. All models have problems. You know what they say, “Every model is wrong, but some are shockingly awful god damn that was intense tho.”

(Code/data for this is here. In order to run the script you will need to download the tweet data, link for which is found here.)

The following is a plot of the sentiment of each comment, in order of when they were made, with placed a flag on each point for each comment that contained the name of each country. (Flag images courtesy of gosquared).

So much England

So much England

After adjusting for uncertainty by using the lower bound of the 95% confidence interval, the following is the sentiment table ranking for each country. (The sentiment column contains the average sentiment, which is NOT directly used for ranking, it gets adjusted by the number of comments (n_comments)).

l95_Rank team sentiment n_comments
1 england 0.606 65
2 croatia 0.663 8
3 argentina 0.531 31
4 sweden 0.587 13
5 portugal 0.538 21
6 south korea 0.611 9
7 germany 0.504 28
8 senegal 0.626 7
9 spain 0.603 8
10 switzerland 0.631 6
11 belgium 0.565 10
12 tunisia 0.560 10
13 denmark 0.558 9
14 egypt 0.587 7
15 mexico 0.556 9
16 brazil 0.504 14
17 japan 0.536 8
18 panama 0.492 10
19 iceland 0.501 9
20 iran 0.444 11
21 russia 0.546 4
22 peru 0.529 3
23 france 0.351 10
24 uruguay 0.523 2
25 australia 0.514 2
26 colombia 0.352 7
27 morocco 0.614 1
28 nigeria 0.489 2
29 poland 0.335 5
30 saudi arabia 0.298 1
31 costa rica 0.144 1
32 serbia 0.001 1

After the uncertainty adjustment, the following is the sentiment table ranking the top 20 most positive users followed by the top 20 most negative. (84 unique users have commented so far).

l95_Rank user sentiment n_comments
1 osmark86 0.586 85
2 hal1ax 0.659 20
3 zakalwe 0.583 61
4 DoofusWainwright 0.567 85
5 anatelier 0.663 14
6 Egarran 0.554 62
7 Flugmorph 0.558 55
8 deezer666 0.546 68
9 pypypymble 0.533 63
10 RadicalEd 0.562 37
11 Sniff 0.553 41
12 Winesburgohio 0.626 14
13 DDDeftoneDDD 0.589 19
14 jagride 0.665 9
15 Maco097 0.583 17
16 Casavir 0.636 10
17 anarchistfish 0.535 27
18 Demon of the Fall 0.539 25
19 iglu 0.534 26
20 Evreaia 0.609 11

u95_Rank user sentiment n_comments
1 Dewinged 0.434 21
2 rabidfish 0.469 32
3 Sinternet 0.483 29
4 Alastor 0.406 12
5 pypypymble 0.533 63
6 deezer666 0.546 68
7 DoofusWainwright 0.567 85
8 Egarran 0.554 62
9 RunOfTheMill 0.417 10
10 Flugmorph 0.558 55
11 Keyblade 0.334 5
12 osmark86 0.586 85
13 Sniff 0.553 41
14 TheNotrap 0.379 6
15 zakalwe 0.583 61
16 anarchistfish 0.535 27
17 Doctuses 0.466 12
18 iglu 0.534 26
19 RadicalEd 0.562 37
20 Kusangii 0.441 9

Some users appear in both the most positive and most negative user rankings. It’s because you commented a lot and the uncertainty adjustment pushes you to the top for both lists. Or it could be math demonstrating its appreciation for the highs and lows, the drama, and the beauty of the game. Math, it contains multitudes.

Pop / Top 40 / General
follow us on Twitter      Contact      Privacy Policy      Terms of Service
Copyright © BANDMINE // All Right Reserved
Return to top