Analyse One Year of Radio Station Songs Aired with Apache Spark, Spark SQL, Spotify, and Databricks
Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data warehouse, Bill Inmon.
This is a guest blog from Paul Leclercq, a data engineer, sports and music lover and was originally published on his personal blog.
Try this notebook in Databricks
Whenever I drive or code, I listen to music, as this happens a lot, and in order to find new songs, I listen to the radio or I listen to Spotify’s discover weekly playlist, which made me like Mondays (because they release it every Monday).
A french old-school institute called Mediamétrie analyzes radio stations’ songs. Since I have seen their study (which I can’t find anymore) some years ago, I have been obsessed with creating my own.
This article will present the year 2016 for 4 main french radio stations through fun SQL queries, then we will connect each song to the Spotify API to create the radio stations’ musical profile.
We will use the Databricks community version to visualize our data. All SQL queries and all results are available on this notebook. It’s the “backstage” of this article, where the magic happens if we can say.
Protip: don’t miss the bonuses at the end of the article
Radio stations introduction
We all have a favorite radio station; mine is Radio Nova for its diversity, its humor, and as a hip hop fan, this is the only national radio where we can hear listenable hip hop songs.
Radio nova had 1,4% of the audience in September 2016 (PDF to download from Mediametrie).
In order to see how a radio becomes number 1, we are also going to analyze the number 1 music radio called NRJ which has 10.8% of the audience and 2 others: Virgin (5%) which, we’ll see, sounds like NRJ, and Skyrock (6%), don’t mind the name it’s a rap radio… haha
The main question is, after we compared these radios, should we give to Radio Nova the tips of how to be the number one based on NRJ’s analyze? What do you say, Nova? Learn from the best, right?!
Getting the Radio’s songs data
In order to extract the songs lists, artist, song title and timestamp, we are going to parse each Radio “What was this song?” HTML pages, except for Skyrock, which has a handy RESTful web service.
Every song extracted will be converted into this Song class, to query them easily with (Spark) SQL:
In 2016 300K broadcasts were collected:
- Nova: 95K broadcasts of 5000 different songs
- NRJ: 50K broadcasts of 800 different songs
- Virgin: 60K broacasts of 1200 different songs
- Skyrock: 100K broadcasts of 1000 different songs
Every song is stored in a parquet format to extract only once the data (you’re welcome radios servers :p) and to speed up SparkSQL queries. Btw, if you are interested in the file, I can export it to you in CSV or parquet.
Remember that the best way to speed up (the Spark doc says often by more than 10x) queries, if you have to use the same SQL table (or Dataset/Dataframe) again and again, is to cache table in memory (Thanks Databricks for the 6Go RAM server!) with the dataframe.cache()
method.
Let’s dive into our analysis now!
How many songs by day?
Some days were not recorded by the radios’ history system, so the real numbers should be a bit higher.
Fun to see that both radio stations broadcast more songs during summer (if we do not take in consideration the one-week bug of Radio Nova, in blue, in August), this is certainly due to summer holidays. They do a good job all year long, so it’s OK to take some days off, I guess!
We can see that Skyrock and Nova broadcast the same number of songs each day, whereas NRJ and Virgin a bit less, certainly due to more talk shows or untracked DJs night shows.
How many different songs by day?
The real difference comes from the number of different songs played; see for yourself the number of different tracks per day:
More mainstream radios such as NRJ, Virgin and Skyrock top 100/120 different songs a day, whereas Nova is more about 280. If you want to discover more songs, it’s clearly on Nova.
How many different songs by month?
If we have a look to the monthly different songs, the gap between radios is even bigger.
Top 10 played titles by each radio station
It’s interesting to see how “hits” are played through the year.
We can notice summer hits: Kaytranada for Nova, Enrique Iglesias for NRJ, Kent Jones and Drake for Skyrock and Imany and Kungs for Virgin. And also, most broadcasted songs are mostly aired during summer.
Radio tends to broadcast more songs during summer. So artists play smart here and release their songs between February and June to have more chance to become number one, or to have more people hating their music because they heard it too many times?
Nova
NRJ
Skyrock
Virgin
Percentage of music by day
If we take the average broadcasted songs by day and the mean duration of a song, 3.30 minutes, we can guess the percentage of music by day. The other percentage is likely to be talk shows, advertising or untracked songs.
To understand more these percentages, we should see what a normal day is for our analyzed radio stations.
What is a typical Monday for our radio stations ?
Let’s have a look to the average of number of songs for all radio stations for Mondays.
We can distinguish 2 gaps during the morning and evening shows for every radio station. Amazing. More seriously, no discovery here; it’s a known fact that most radios have morning and evening shows during which there is less music and more talk.
Advertising time
If we recalculate the average percentage of music at noon, when there are no shows for all radio stations, we can estimate the percentage of advertising by radio by hour. We estimate that the radio hosts speak 5 minutes during the whole hour. We have to note that radios may advertise more during prime time when they have a larger audience.
For 60 minutes, we get 7 minutes of advertising time, for Skyrock, to 15 minutes, for Virgin. In details, we have this table:
Radios brainwashing?
An annoying feeling we have sometimes with radios is we keep listening to the same songs over and over. As we are men and women who believe in science and not in our instinct, we are going to use basic statistics to verify this weird feeling.
How many times is the same song aired on the same day?
These pie charts below tell us a lot about radio stations’s habits; that is, more mainstream radios such as Virgin, NRJ or Skyrock are more about to broadcasts the same songs multiple times.
When is the next time we will listen to the same song during the same day?
Again, the most mainstream radios, NRJ, Skyrock and Virgin tend to broadcast the same song most often for 2/3 hours since it was first aired. Nova’s value is more about 7/8 hours.
While we have different distribution, the average for our 4 radios is between 7 and 8 hours.
How many new songs are added and when?
“New songs” means songs that are not yet broadcasted in 2016.
If we look at the average after April 2016, we see that Nova is ahead, but don’t forget that Nova plays 2500 different songs each month, so it’s normal, statistically speaking.
New songs are distributed equally along the week for all radios.
Common songs between radio stations
On the table below, we can see NRJ has 25% of common songs with Virgin and 12% with Skyrock.
Virgin has 18% with NRJ while Skyrock has 9% of common songs with NRJ.
Nova has a few similar songs with the others radio; there are mostly legendary artists such as Bob Marley, Daft Punk, Aloe Blacc, Kavinsky, Beyoncé… If you are interested by the full list look for the “Similar songs between radios” cell in the “backstage” AKA the blog article’s Databricks notebook.
Our 4 radio stations are different, for sure, but do they have common songs among them? Surprisingly the answer is yes.
I would classify these songs as songs that everybody likes; you can play them at your party without any stress of being booed.
If we use a visualization for our previous table, it will look like this: the blue bar is the similar songs, the orange and the green bar are the total of different songs.
What are the secrets to be #1 ?
We have analyzed 4 radio stations based on the artist name, the title name and the day and time the songs were broadcasted. Beside letters and numbers, these 3 values mean nothing, if we want to make a deeper analysis, we have to learn more about the songs played: how popular is the song right now? what is the genre of the song? How many followers does the artist have?
Hopefully, by connecting each song to the Spotify API we will get a lot of data we can play with:
In 2016, we have collected 8000 different songs from the radios, so to get the artist, the track and the tracks’ audio features from the Spotify API we have to make:
That’s a lot. Plus, Spotify has a limit of request in time, so we have got to do it slowly, 20 request every 2 seconds, why not you know.
BUT, with this slow rate one thing I didn’t plan is we could see the number of followers change when we requested a song’s artist, as most artists have multiple songs been broadcasted, the artist information was asked from 2 to 10 times. No problemo, right? No…This will mess up our SQL join between artist and track data later, just because the DISTINCT on artists information were fake due to followers.total
I have to say this led me to craziness, because I had more songs after my join than before haha
Songs Popularity By Radio
Definition by Spotify
The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays were.
No surprises here, mainstream radios NRJ, Virgin or Skyrock, tend to play more popular songs; that’s why I use the term mainstream, clever, right?
But the real question is: was the song popular before it was broadcasted on the radio?
Audio features
The Spotify API gives audio features extracted from the song’s sound waves, thanks to these we can display a musical profile of each radio:
In my opinion, the most meaningful audio features are:
- danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
- energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
- valence describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Let’s see their average and their distribution, as a average alone can be sometimes misleading, among the radios’ tracks. As Nova got more different songs than the others, we are going use percentage to compare our radios to add more context to our stats.
If you have read my Facebook Interview Journey, you know this is where I failed during my SQL interview, this code is specially for you, dear Mr. Interviewer, no hard feelings though :p
Energy
Mainstream radios tend to play more energetic songs, I guess they are easier to listen to? Some examples of song with a lot of energy are We Are Your Friends - JUSTICE, Steppin’ Stone - Davy Jones, and of course, the classics from Jerk It Out — Caesars, I’ve first heard it while playing SSX3 on GameCube 8)
Danceability
This chart tells us both radio broadcasts with the same kind of danceable songs. Some examples of danceable songs are Trick Me — Kelis, Around the World — Daft Punk or Anaconda — Nicki Minaj.
Valence / Positiveness
Same as the danceability, both radio broadcasts show the same kind of positive tracks. Some examples: September — Earth Wind & Fire,
Ska-Boo-Da-Ba — The Skatalites, or Hey Ya! — OutKast.
Two others interesting data points, which are not Spotify (Echo Nest) specific, are the BPM (beats per minute) and the songs duration.
Tempo / Beats per minute
Duration
Nova seems to be a bit different from the other radios by playing shorter or longer tracks. Virgin, NRJ and Skyrock are really into 3-minute tracks.
When I first saw this graph, I couldn’t help myself to think about this Hocus Pocus’ song called “Voyage Immobile” (motionless journey) and this sentence about our un-diversified musical environment:
“Je ne voyais que blocs longs de 3 minutes taillé dans le roc et dans le même but”
“I could only see 3-minute blocks from the same base with the same goal”
Music genres
Spotify got some pretty weird music genres. Have you noticed “post-teen pop”, “pop christmas”, pop songs you listen during christmas I guess? haha
We can clearly see that NRJ and Virgin, which are very alike, are more about pop/dance/electro music; their top 3 genres are: pop, dance pop and tropical house. Nova is about soul, funk and indie music, and Skyrock is more about rap, dance and pop.
Hip hop genres
Skyrock is famous for its motto “1st on Rap”. Let’s compare Hip hop/Rap genres (genres with “rap”, “hip” or “hop” inside the name) with the others radios.
OK, that’s a close match between Skyrock and Nova. Let’s compare the internal hip hop genres now.
I don’t really care about genres, but there is a lot of confusion between Hip hop, which is a culture, and rap, which is the actual fact of rapping; if you want to learn more, check this Wikipedia Chapter, I also recommend the excellent Netflix’s documentary “Evolution of Hip Hop”
Nova, in orange, is more about indie/alternative/undergroup hip hop music, and Skyrock, in blue, is really more into French rap/trap/hiphop and also popular rap. So let’s fix Skyrock’s motto by “1st on French rap” haha.
Music classifier for Radios’ selection idea
In my last article, I explained how to create your own music recommendation system thanks to these audio features.
A fun project (the link is a tribute to the Scala Guru Martin Odersky, he tends to say too many times that his Scala exercises are fun whereas they are brain melt haha) would be to create an algorithm that will help music selectors to find radios style’s songs.
Spotify recommendation system
Spotify’s system is not only based on the audio features we saw earlier. It also analyzes what others similar users listen to. This slide contains a nice schema that explains their whole system.
What’s next?
Thanks to this project, I have built solid foundations to query the Spotify API in Scala; process it thanks to Spark SQL, and visualize it thanks to Databricks. I think more projects are about to come, plus Spotify has just released, March 2017, this new endpoint “Recently Played Tracks” and ideas are coming.
Databricks pros and cons
Pros
- Free community edition with 6Go RAM server
- Awesome and easy-to-use Data Viz
Cons (or more, what can be better)
- Can only visualize a maximum of 10 elements when using a GROUP BY; the others elements go to one category called “Others”
- Not possible to choose the color of an entity, so a Radio can be blue on a graph and red on another; it can be sometimes confusing
- Cannot export graph as iframe, so we have to export pictures from the interactive graphs
- Modify SQL on the Data Viz interface
Thanks
Databricks, for their awesome platform.
Spotify, for their easy-to-use API and their human-readable documentation
Radio Nova for being a top music selecta, I would not listen to the same music that I listen to today without you.
Marc H’LIMI, Radio Nova’s advisor, for our exchanges
Pierre Trussart, engineer and DJ, Benjamin Thuillier, scala rockstar, Nicolas Duforet, data science master, Justine Mouron, engineer.
My friends for hearing me talking about this project too often.
Bonus — Spotify Playlists
To thank for reading, I created 4 playlists of the most ~200 songs broadcasted sorted by the number of broadcast for: