Hello fellow football fans, and welcome to a post that will assign numbers to your footy feelings. On the eve of the 2018 World Cup, I noticed that no one had made a “World Cup Thread”-type list, so I decided to start one. At some point, I realized that I could leverage the comments people were making into some sweet, sweet content. Specifically, I sought to measure the sentiment of each comment (positive or negative) which I could then summarize by World Cup team and by user.
Recently, I was googling sentiment analysis and came upon this post. The post describes and has code for a model that uses the words of tweets to predict the sentiment of each tweet (a sentiment of 1 being positive, 0 negative). The post is from about a year ago, uses tweets as training data not sputnik comments, so it may not exactly match the vocabulary and sincerity level of our own sputnik soccer commenters. Regardless, I fit the sentiment model from the code in that post, scraped the comments from the World Cup 2018 list following the conclusion of the group stages, and then fit the sentiment model to each comment. The model assigns each comment a value between 0 and 1, 1 being positive, and 0 being negative. Most comments lie somewhere in the middle, ~ 75% of comments are between .25 and .75 and ~ 92% are between .1 and .9.
So, after classifying every comment with that model, I searched the comments for mentions of each team. (Note: I used regular expression to find them, so if you were playing the pronoun game when referring to a team, I could not detect a comment as containing a particular country.) Many comments mentioned multiple teams, some that seemingly had multiple sentiments within them, so I decided to only assign a label to comments that only mentioned one country.
The model has a fairly good test AUC of .87 (meaning that it performs pretty well on predicting sentiment out of sample), but it isn’t perfect. For instance here are some high sentiments comments(>.9) that seem off:
Are there any proud prostitution moments?
let’s see what belgium does
Told ya
come off it will ya
Kane is awesome and thicc as fuck agreed
The model is also not a good lie detector, apparently. Here are some weird low sentiment comments (<.15):
Modric tho
coutinho god damn
damn that was intense
Another problem, when someone repeats someone else’s comment but with a “[2]”, it will give slightly different sentiments. For instance, “Argentina are shockingly awful ” and “Argentina are shockingly awful [2]” get a value of .099 and .125 respectively. All models have problems. You know what they say, “Every model is wrong, but some are shockingly awful god damn that was intense tho.”
(Code/data for this is here. In order to run the script you will need to download the tweet data, link for which is found here.)
The following is a plot of the sentiment of each comment, in order of when they were made, with placed a flag on each point for each comment that contained the name of each country. (Flag images courtesy of gosquared).
After adjusting for uncertainty by using the lower bound of the 95% confidence interval, the following is the sentiment table ranking for each country. (The sentiment column contains the average sentiment, which is NOT directly used for ranking, it gets adjusted by the number of comments (n_comments)).
l95_Rank | team | sentiment | n_comments |
---|---|---|---|
1 | england | 0.606 | 65 |
2 | croatia | 0.663 | 8 |
3 | argentina | 0.531 | 31 |
4 | sweden | 0.587 | 13 |
5 | portugal | 0.538 | 21 |
6 | south korea | 0.611 | 9 |
7 | germany | 0.504 | 28 |
8 | senegal | 0.626 | 7 |
9 | spain | 0.603 | 8 |
10 | switzerland | 0.631 | 6 |
11 | belgium | 0.565 | 10 |
12 | tunisia | 0.560 | 10 |
13 | denmark | 0.558 | 9 |
14 | egypt | 0.587 | 7 |
15 | mexico | 0.556 | 9 |
16 | brazil | 0.504 | 14 |
17 | japan | 0.536 | 8 |
18 | panama | 0.492 | 10 |
19 | iceland | 0.501 | 9 |
20 | iran | 0.444 | 11 |
21 | russia | 0.546 | 4 |
22 | peru | 0.529 | 3 |
23 | france | 0.351 | 10 |
24 | uruguay | 0.523 | 2 |
25 | australia | 0.514 | 2 |
26 | colombia | 0.352 | 7 |
27 | morocco | 0.614 | 1 |
28 | nigeria | 0.489 | 2 |
29 | poland | 0.335 | 5 |
30 | saudi arabia | 0.298 | 1 |
31 | costa rica | 0.144 | 1 |
32 | serbia | 0.001 | 1 |
After the uncertainty adjustment, the following is the sentiment table ranking the top 20 most positive users followed by the top 20 most negative. (84 unique users have commented so far).
l95_Rank | user | sentiment | n_comments |
---|---|---|---|
1 | osmark86 | 0.586 | 85 |
2 | hal1ax | 0.659 | 20 |
3 | zakalwe | 0.583 | 61 |
4 | DoofusWainwright | 0.567 | 85 |
5 | anatelier | 0.663 | 14 |
6 | Egarran | 0.554 | 62 |
7 | Flugmorph | 0.558 | 55 |
8 | deezer666 | 0.546 | 68 |
9 | pypypymble | 0.533 | 63 |
10 | RadicalEd | 0.562 | 37 |
11 | Sniff | 0.553 | 41 |
12 | Winesburgohio | 0.626 | 14 |
13 | DDDeftoneDDD | 0.589 | 19 |
14 | jagride | 0.665 | 9 |
15 | Maco097 | 0.583 | 17 |
16 | Casavir | 0.636 | 10 |
17 | anarchistfish | 0.535 | 27 |
18 | Demon of the Fall | 0.539 | 25 |
19 | iglu | 0.534 | 26 |
20 | Evreaia | 0.609 | 11 |
…
u95_Rank | user | sentiment | n_comments |
---|---|---|---|
1 | Dewinged | 0.434 | 21 |
2 | rabidfish | 0.469 | 32 |
3 | Sinternet | 0.483 | 29 |
4 | Alastor | 0.406 | 12 |
5 | pypypymble | 0.533 | 63 |
6 | deezer666 | 0.546 | 68 |
7 | DoofusWainwright | 0.567 | 85 |
8 | Egarran | 0.554 | 62 |
9 | RunOfTheMill | 0.417 | 10 |
10 | Flugmorph | 0.558 | 55 |
11 | Keyblade | 0.334 | 5 |
12 | osmark86 | 0.586 | 85 |
13 | Sniff | 0.553 | 41 |
14 | TheNotrap | 0.379 | 6 |
15 | zakalwe | 0.583 | 61 |
16 | anarchistfish | 0.535 | 27 |
17 | Doctuses | 0.466 | 12 |
18 | iglu | 0.534 | 26 |
19 | RadicalEd | 0.562 | 37 |
20 | Kusangii | 0.441 | 9 |
Some users appear in both the most positive and most negative user rankings. It’s because you commented a lot and the uncertainty adjustment pushes you to the top for both lists. Or it could be math demonstrating its appreciation for the highs and lows, the drama, and the beauty of the game. Math, it contains multitudes.