Weekly Releases: 04.26.2024SputStaff Top 10: YellowcardLive Review: Deaf Club, Fuck Money @The...Weekly Releases: 04.19.2024Top Album Covers of 2024 Q1Weekly Releases 04.12.24

Statnik 10: Metacritic

Published: February 25, 2018

Hello, fellow numbers in a weighted average, and welcome to an investigation of how much weight this site really has. It may come as a shock to you or for some reason you’ve never even considered it, but metacritic collects the ratings from the staff reviews of this very website to make their average scores. Specifically, they convert the rating in the reviews, and they scale them to 100 by multiplying them by 20 (i.e. a 4.6 review becomes a 92 on metacritic). Then, for each particular album with more than 4 scores, they calculate a weighted average. Weights for each publication are assigned “based on their quality and overall stature“, and these weights are not revealed to the public. They include this question in their FAQ:

CAN YOU TELL ME HOW EACH OF THE DIFFERENT CRITICS ARE WEIGHTED IN YOUR FORMULA?

Absolutely not.

I’m fond of weighted averages. For instance, my user-usage adjusted means are weighted averages. They are simple mathematically and conceptually. “Imagine if you got 2 votes and everyone else had 1.” Boom. Weighted average. “Imagine if you got a vote proportional to your wealth.” Boom. Politics around the globe (and a weighted average). The problem with weighted averages (relative to, say, statistical models) is that assigning weights is an arbitrary exercise. For my user-usage adjusted average ratings, I assign a weight of 3 to the count of a user’s reviews, 2 to lists, and 1 to comments and ratings counts. Were these weights ordained from God? No, they just seemed right to me, and at some point I had to just to pick something.

Arbitrary or not, via weighted average or not, review/rating aggregation has become a boogie man of sorts. I’ve come across a lot of articles pearl-clutching over its negative effects on consumers and especially the movie industry (here, here, and here). To summarize some of the criticisms, good aggregate scores (whether via Rotten Tomatoes, metacritic, imdb, RYM, Sputnik or your local book/knitting/gossiping club) supposedly affect sales and have been sometimes used to set incentives for bonuses for creators. Given the economic importance, aggregators may be ripe for corruption or unethically deceptive in some way. Aggregation possibly affects the risk/reward calculus of creators. If aggregated scores become a target for filmmakers/musicians/game developers, then maybe they are less likely to try making risky, moonshot projects. Aggregated review scores mask the variance or individual difference of the audience and their experiences (i.e. a particular movie is your Rushmore but somebody else’s Birdemic). If the reviewers are not the intended audience and if there is an imbalance in the reviewer pool (for instance, a gender or race disparity), it can lead to scores that discount aspects important to a project’s intended/natural audience. Most devastating of all, sometimes you put out a bad movie; and, by having a big name actor or basing it on pre-existing IP, you try to hide how bad it is and still make some money on it; but those meddling reviewers take turns dunking on you and the ledger of this posterization can be found on a single website.

People in the past have tried to calculate the metacritic weights (this guy, for instance, whose post will become important if you read further). With errors, some researchers published the weights (and more importantly the tiers) of video game reviewers. Following that, Metacritic released a facebook post arguing that the researchers were inaccurate and then generally trashed their analysis. In that post they reveal some interesting (possibly smoke screen-ed) information. They share that their weight disparity between publications is not any higher than 6 times (meaning the lowest weighted publication is given at least more than 1/6th the weight of the highest publication) and that they do update their weights based on if “a publication demonstrates an increase or decrease in overall quality”.

It’s unclear if these rules apply to the music metascores, but if the stakes for being a “high quality” publication net you less than 6 times influence over a “low quality publication”, with so little leverage either way, why even do a weighted average over just a simple average? If the weights aren’t public, and the process for updating them is not transparent, why even change them? More importantly, if the weights are as tepid as metacritic says, then we shouldn’t be “mad on the internet” about their process… unless they’re giving us a low weight.

For the purposes of finding out sputnik’s metacritic weight, I scraped data for 250 of the 1600+ listed albums that sputnik has reviews for on the metacritic sputnik profile page. Since this is all fun and games (everybody is having fun, right?) and I (ultra-powerful blogger, I) don’t want to anger the metacritic people, I anonymized all the other publications. There were 92 other publications that had reviews on those album pages, and instead of revealing them, I gave them an anonymized number — they aren’t even in alphabetical order. Each row of the data is from a particular album, but no identifying information is given for that album page. Also, when a publication had more than one score for an album, I averaged their scores and that is what is included in the dataset. Basically, no one can take this information and get the weights for any other publication. (Dataset here).

Because, as far as I know, the scores are a weighted average and not the result of a more complex model, I initially could not come up with a way to calculate the weights. I googled until I stumbled on this blog post that reverse engineered the weights for movie review publications. I almost gave up on this whole thing because it was too much googling and too little doing, but, then, I hit up google some more and figured out how to do some crude optimization in R. (Another reason to not worry that this analysis means anything, metacritic, is that I don’t really know what I’m doing.)

In order to do the optimization I had to come up with a loss function (I know, I barely know what it is either). Essentially, in this particular analysis, weights for a weighted average are given to each publication (at the start, all the weights are set to the same value, which is equivalent to just doing the standard arithmetic mean) and an optimization algorithm perturbs those weights and searches until it finds a set that minimize some error value of importance.

The value that the algorithms looks to minimize in this analysis is the root mean squared error (rmse) of the actual metacritic average of the albums compared to a set of calculate weighted averages. In order to help guide and constrain this search, I also set some penalties for when the weights have an improper quality. The rmse is increased proportional to the degree to which the weights sum differ from 1 times 100 (the weights should add up to 1, and is penalized if they do not), and it is increased the amount of weights that are above 1 or below 0.001 times 100 (no weight should be above 1 or below 0.001). I am skeptical that the highest possible ratio between high and low quality publications is 6 given the outcome, for instance, of these scores from the mewithoutyou’s Ten Stories page. Assuming there is no outlier/lowest score removal, and that they round up from .5’s, Absolute Punk would have to have a weight of at least 7 times the other 3 publications (one of them being us).

# R code for possible weighted average of mewithoutyou - Ten Stories with weights
> weighted.mean(c(85, 80, 80, 80), c(.7,.1,.1,.1)) # = 83.5

All that they have signaled indicates that they are commited to weighted averages with publications having some variance in their weight, and I believe them, so I also set a penalty of 100 if the disparity for the highest and lowest weights is greater than 15 or less than 2. Finally, since I know almost nothing about optimization and which type of algorithm suits this analysis best, I ran multiple types with the optimx library written by (not-Russell Crowe) John Nash. When I was testing this some algorithms seemed to not search at all, and some suggested that there were too many parameters, so I narrowed it down to the 5 types listed in the optimx function call of my script. All this optimization is done 3-fold cross-validated. (For those of you who don’t know, that means the data is split 3 ways, each individual chunk is separately withheld from the optimization and used as the data to test the weights on. This is done 3 times, one for each chunk, and the rmse is averaged across the 3 folds.) Why all these specific parameters? They all seemed right to me, and at some point I had to pick something. (Full analysis script here).

So, with all this, what is Sputnikmusic’s weight? It depends, but it’s not good.

Verticle line is the Sputnikmusic weight, weights are scaled relative to the maximum publication weight

Top 3 optimizations according to test RMSE. Verticle line is the Sputnikmusic weight, and weights are scaled relative to the maximum publication weight

After filtering down to the top 3 optimization methods, sputnik is either last (in the L-BFGS-B and newuoa optimizations) or 3rd to last (in the CG optimization) amongst the 93 publications. Sputnik is given either around 23%, 14%, or 29% of the weight of the top publication via the CG, L-BFGS-B, or newuoa optimization methods respectively.

There are pretty big limitations of this analysis. This is all heavily assumption based, and metacritic has an incentive to be misleading about their process for creating these averages. There could be more to it than they let on. Even if it were just a simple weighted average, it’s also possible that they do change their weights significantly from time to time, perhaps penalizing/rewarding publications for producing few/many reviews, penalizing/rewarding for giving outlier/inliner scores, or they factor in social media followers or alexa ranks, etc. Maybe the weights differ per publication by the genre of music or some other feature of an album. The testing RMSE’s for the 3 top optimazations are 6.7, 7.2, and 7.6, whereas the RMSE for the simple arithmetic mean has a value of around 11. These weights only lightly improved the accuracy of the averages, so this all may be gobbledy goop.

And who cares anyways? It’s all arbitrary, right?

… Right?