Statnik 1

Published: July 19, 2017

Hello world, and welcome to a first-of-its-kind staff blog, one written by someone with no reviews and pedestrian/almost non-existent music taste. I joined the site when I was trying to find something to fall deeply into, and I thought being the only person I knew that liked Led Zeppelin meant that I could become a SERIOUS music listener. Of course, I failed and, besides a real weak stream of bands I like, I don’t listen to much music. As a result, I stopped regularly visiting the site after maybe 6 months of being a regular commenting member. Nevertheless, I returned to the site because I found a different interest.

Now, I’m not an expert in this kind of stuff. I didn’t actually get a degree in the kind of thing that would make one a said expert. More than anything I’m a diligent and creative googler, the level at which you can fake expertise. I like data. Since I decide I liked data, I have done SERIOUS data guy stuff. I started to keep track of my stats in video games like COD: BO and Rocket League, and I analyzed this here website. (That’s it. There’s not really a third thing, I tried making a tool to help me with a fantasy football draft once, but people were drafting so fast it actually was probably more costly than useful.)

So, as best I could tell, these blogs will go something like this: I’ll write some kind of description of some cool thing I’ve done/am doing with the data on this site, maybe there will be a story of some sort, and then I’ll present code for how to do it. You’ll cheer, you’ll cry, and you’ll learn stuff. I don’t know how regularly I’ll make you cheer/cry/learn, but it will not be never!

The first thing I had to learn to do any of the lists I have done, was figure out how to grab data (sometimes referred to as data scraping or munging or back-alley mugging) from the website. I initially tried copying the ratings from soundoff pages by GoogleChrome:Right Click > Inspect and copying the table objects into excel. But that was tedious and unscalable, so then I found out how to do it with my current go-to language, R.

(It’s free! To install R, go this website https://cran.r-project.org/mirrors.html, download from any mirror you like, and then download this handy IDE https://www.rstudio.com/ to make working with R a breeze rather than a blast-from-the-past-trembling-fear-inducing chore caused by the standard RGUI.)

R is a statistics-focused language. Packages and online code examples are often written for and by college math/stats/information sciences department people. It’s relatively simple to use and follow, and there are a lot of free books, moocs, and blog posts on how to use R. But what makes R especially good, is that it’s free. It doesn’t cost thousands of dollars like other comparable languages do (Matlab… SAS… Stata). End paid advertisement.

R has a package that lets you load webpage html code as text-like objects. (Some websites don’t let you read their html code with R and presumably with any other language, and I don’t know why or how, but it happens.) It has functions to let you interact with html code, such as finding specific html tags, external html links found on pages, and (a third thing!), importantly, html tables. The rating data, as well as a lot of things on sputnikmusic, are stored in html tables. If you install the packages dplyr (a very cool data manipulation package) and XML (the package to read web html) with the following code:

install.packages(c("dplyr","XML"))

and then run the following code in the R console,


library(dplyr)
library(XML)
scrape_soundoff <- function(obj,link){ # function to scrape data
if(!any(class(obj) %in% c("HTMLInternalDocument", "HTMLInternalDocument",
"XMLInternalDocument", "XMLAbstractDocument")) && is.character(obj)){
link <- obj
if(!grepl('/soundoff.php',obj)) stop('Character string provided is not a soundoff page')
obj <- htmlParse(obj)
}
if(!any(class(obj) %in% c("HTMLInternalDocument", "HTMLInternalDocument",
"XMLInternalDocument", "XMLAbstractDocument")) && !is.character(obj)) {
stop(paste0('input is not an html object or a character.',
'if calling this function directly,',
' ensure that your input is a character ',
'string that is the name of sputnikmusic soundoff page.'))
}
user_links <- grep('/user/',getHTMLLinks(obj),value = TRUE)
links <- getHTMLLinks(obj)
if(length(grep('/best/albums/',links))>1){ # if theirs more than one link to best albums
# ... that means the release year is in the second link
release.year <- as.numeric(tail(unlist(strsplit(tail(grep('/best/albums/',links,value = TRUE),1),'/')),1))
}else{
release.year <- as.numeric(tail(unlist(strsplit(unlist(lapply(xpathSApply(obj, "//b"),xmlToList)[[2]]),'/')),1))
}
dat<-readHTMLTable(obj,which = 1)
if(any(grepl('https:/',user_links))) user_links <- user_links[-grep('https:/',user_links)]
dat$V2 <- as.character(dat$V2)
names(dat)[1] <- 'Rating'
dat <- dat[!is.na(dat$V2),]
dat <- dat[c(1,2)]
dat$Rating <- as.numeric(substr(dat$Rating,1,3))
dat <- dat[!is.na(dat$Rating),]
dat <- dat[2:nrow(dat),]
# for (i in 1:length(dat$V2))
split_rating <- function(dat){
tmpstr <- strsplit(x = dat, split = ' | ',fixed = TRUE)
out <- data.frame(user = tmpstr[[1]][1],date = tmpstr[[1]][2])
return(out)
}
dat <- data.frame(dat,bind_rows(lapply(dat$V2,split_rating)))
user_links <- user_links[dat$Rating>0]
dat <- dat[dat$Rating>0,]
dat <- dat[, c(1,3,4)]
sputdate <- function(dates){
if(!is.character(dates)) stop('Not readable')
s<-c()
for (i in 1:length(dates))
{
tmpdate <- unlist(strsplit(dates[i], ' '))
if (length(tmpdate)==1){
s[i] <- NA}
else
{
mo <- grep(strsplit(tmpdate, ' ')[1],month.name)
da <- as.numeric(substr(tmpdate[2],1,nchar(tmpdate[2])-2))
ye <- paste0('20',tmpdate[3])
s[i] <- paste(ye,mo,da,sep = "/")
}
}
s <- as.Date(s,"%Y/%m/%d")
return(s)
}
dat$date <- sputdate(dat$date)
trim.trailing <- function (x) sub("\s+$", "", x)
dat$user <- trim.trailing(dat$user)
dat$userlinks <- substr(user_links,7,nchar(user_links))
dat$albumlink <- link
dat <- data.frame(release.year,dat)
dat <- dat[order(dat$date,decreasing = TRUE),]
return(dat)
}
sputurl <- "https://www.sputnikmusic.com/soundoff.php?albumid=14363"
dat <- scrape_soundoff(sputurl)
print(head(dat))

… you will have read the soundoff page for the critically un-thought-of (seriously, it’s wiki page is empty) My Fruit Psychobells… A Seed Combustible album by the sputcore band “maudlin of the Well”. Your object, which is named “dat”, short for “data”, will be a table-like object containing the release year for the album, every rating on the soundoff page, with the listed name of every user (the one you can edit), as well as their official name (the one that has to be unique and appears on the url of someone’s profile page), the date of every rating, and the soundoff page link. You can print its contents by typing “dat”, sans quatotion marks, into the R console and hitting enter. You can replace the link in the “sputurl” string with whatever soundoff page you want, and you can get the ratings data for that album instead (i.e. sputurl <- “https://www.sputnikmusic.com/soundoff.php?albumid=xxxxx”).

You can then do things like, make a histogram of all the ratings, like can be found on a review page,

hist(dat$Rating,xlab = 'Rating',ylab='Count')
histogram_st1_2
or you can plot each rating as a time series,

with(dat,plot(date,Rating,'l',main='Timeseries'))
timeseries_st1
or a time series with a smoothed trend line.

with(dat,plot(date,Rating,'l',main='Smoothed Timeseries'))
lines(dat$date[!is.na(dat$date)],
loess(Rating~as.numeric(date),dat,span = .5)$fitted,'l',col = 'red')
smooth_timeseries_st1
And that’s just the start of what you can do when you have a big imagination… and google.

P.S. Github link is here for a version of this code that will also add band, album, and genre tag information to the “dat” table.

Pop / Top 40 / General
follow us on Twitter      Contact      Privacy Policy      Terms of Service
Copyright © BANDMINE // All Right Reserved
Return to top