Visualizing vernacular variety in various vinyls

Hello, I’m Jack, a developer at Datawrapper. It’s Thursday again, so here comes a new musical Weekly Chart!

This week I took inspiration from two sources: my colleague Vivien’s chart from a couple of weeks ago, where she analyzed lyrics from the band TEMMIS, and Matt Daniels’ superb article “The Largest Vocabulary In Hip Hop” where he ranked rappers by the number of unique words used in their lyrics.

I wanted to try to visualize not only the absolute size of my favorite musicians’ vocabulary, but also how unique their lyrics are relative to other artists’.

Getting the data

As a big fan of concept albums, I decided to analyze the lyrics on a per-album basis. My first step was to find my 200 most listened to albums using the LastFM API. I then filtered out singles, instrumental albums, and albums in languages other than English. Next, using the Genius API, I fetched the full lyrics for each remaining album.

Processing the data

Next the lyrics need to be processed:

  • Convert to lowercase: To avoid The and the being counted as different words.
  • Remove Genius metadata: The lyrics from Genius contain information like the number of contributors, placeholders for embeds, and section headings ([Verse 1], [Chorus], …).
  • Undrop the ‘G’: Depending on the flow and accent of the artist, the ‘G’ will sometimes be dropped at the end of verbs and replaced with an apostrophe (ex: ”formulatingformulatin’”). To avoid counting these as different words, lyrics ending in “-in’” are replaced with “-ing”.
  • Remove stopwords and punctuation: To limit the size of the dataset, I removed the most commonly occurring words ("the", "to",...), single character words, and punctuation.
  • Convert numbers to words: To avoid “100” and “one hundred” being counted as different words.
  • Lemmatization and stemming: To reduce the words to their roots (ex: “studying” and “studies” both become “study”).

Analyzing the data

Now that the data is clean, I can start analyzing it.

First, I counted the number of non-repeated words in each album, only counting a word if it hadn’t already appeared in that album already. Dividing this by the total number of words in the album shows how varied the lyrics are within the album. Let’s call this measure variety; albums with higher variety spend less time repeating themselves.

I then counted all the words that appeared only in that album — and not in any other from my top 200 — and divided that by the number of non-repeated words. Let’s call this uniqueness; albums with higher uniqueness spend more time saying things no one else has said. I plotted the albums with variety on the vertical axis and uniqueness on the horizontal. (To keep the chart from getting too crowded, only the top 100 albums are actually shown).

What I found

Interestingly, the albums from a given artist tend to be close to each other on both axes, showing consistency in their lyrical style. Hip-hop albums (the diamond-shaped symbols) tended to be higher on both metrics than other genres, with "Get Rich Or Die Tryin’" by 50 Cent being the only rap album with no unique lyrics.

As I expected, Aesop Rock ends up at the top of both rankings. With his famously verbose and high-concept albums — including songs about politics, murders in the 80s, therapy, his cat Kirby, and everything in between — it’s unsurprising that those lyrics are so unique. Lupe Fiasco and Sa-Roc, with their dense political lyrics, are similarly close to the top-right. Surprisingly, although 2Pac’s albums "All Eyez on Me" and "Better Dayz" have the most total lyrics, they have the least variety of any rap albums in the data set.

Pink Floyd's "The Wall" ended up with a much higher uniqueness score than I expected; I took a closer look and realized this was because of one song, "Empty Spaces." It contains a dialog with backwards speech, which makes for some very unique words…

I hope you enjoyed visualizing some vernacular variability with me. Check in next week for a new Weekly Chart from Julian.