Using Machine Learning to Attempt to Determine a Song’s Genre by the Lyrics

By Michael Smith

The Basics

Machine learning shows many uses throughout the cybersecurity field, being used for spam detection, malware detection, and analyzing network traffic to determine whether an attack is occurring. Research is being made consistently into the field of machine learning and artificial intelligence in order to improve the algorithms and determine new uses for these tools.

Within the scope of machine learning, there are a variety of methods in which one can analyze a dataset in order to learn from previous data and label new incoming data. The list includes, but is not limited to the following:

  • Linear Regression
  • Logistic Regression
  • Decision Trees
  • Random Forests
  • Support Vector Machines
  • Naïve Bayes
  • K-Nearest Neighbors
  • K-Means

Each of these algorithms uses a different approach of analyzing previously obtained data, then they will use the knowledge gained from that data, which is considered the training set, to infer information for the new incoming data. Each algorithm has its own strengths and weaknesses, like being better with anomalous data, or worse with dealing with groups of data at the extremes. For example, through usage of a dataset containing both spam messages and ham messages, one can train Naïve Bayes classifiers on the contained words in those messages, and the classifier will be able to determine whether an new incoming message is spam or ham based on the training set.

The Question

The question that I had asked myself when trying to determine how to work machine learning into this research was “What is something that I would not ordinarily think of when it comes to machine learning, but might be able to use it”. From this question, I landed on the following: How accurate would a machine learning algorithm, in this case Naïve Bayes classifier, be when attempting to determine the genre of a song based solely on the lyrics? Can Naïve Bayes take a set of songs lyrics and correctly determine what genre of music the song is solely based on that?

The Experiment

To begin this test, it is important that the dataset that will be used for the training has the correct type of data. For this we will do what is called “feature selection”, which is the process of only including features, or variables, that we determine important. The dataset that will be used for this experiment is the million songs dataset, which can be found at When all of the data is combined, we are given a set that includes trackID, lyrics list (by popularity order), genre, then finally the name of the song. By using the aforementioned feature selection, we cut these features down to the absolute necessary parts; lyrics and genre. In a real application, the song name could be added in as a way to identify the songs outside of the experiment, but for the test I mainly needed the accuracy, so they were left out. To begin with, I will start by checking whether songs from 2 different genres will be able to be identified. Each set of lyrics will be entered into a dictionary based on what genre the song is, and those dictionaries will be used in a calculation for frequency in the Naïve Bayes classifier.

The Results

Source of image:

The dataset used was split into smaller sections, then used. While the above frequencies use different values than the values I received, the words included are largely the same.

The important part to note in the above graphs is that there are some words that appear more exclusively in certain genres over others, for example “diddley” only appears in R&B, while “love” appears in almost every genre. This means that those words like “diddley” are going to hold higher weights than words like “love”. This is shown through the Naïve Bayes equation:

In my final version of the code, I ended up only comparing Rock vs Pop songs, as they were two of the most populous song genres. While I could have created a classifier that would have classified all genres included, my knowledge in the topic of machine learning is still on a basic level and building of a classifier in that scale would have taken much longer. I will go into more detail on this in the conclusion.

The first test ran a total of 1000 songs through the experiment. The results for the first run of the experiment were as follows (Rock was considered the positive, and Pop was considered the negative):

TP (True Positive)Rock marked as Rock311
FP (False Positive)Pop marked as Rock283
TN (True Negative)Pop marked as Pop125
FN (False Negative)Rock marked as Pop281
Accuracy(TP+TN)/(TP+FP+TN+FN) Overall correct accuracy43.6%
PrecisionTP/(TP+FP) Accuracy of Positive IDs52.36%
RecallTP/(TP+FN) Accuracy of ID of Rock52.53%

Here we can see the results, with each term being as described in the following. True Positive (TP) are Rock songs that were accurately labeled as a Rock Song. False Positives are Pop songs that were inaccurately labeled as a Rock song. True Negatives are Pop songs that were accurately labeled as Pop songs. False Negatives are Rock songs that were labeled as a Pop song. Accuracy is the percentage of correct labels the classifier has performed. Precision is the accuracy of all songs labelled as Rock songs. Recall is the accuracy of the actual rock songs.

Obviously, these results aren’t comforting. An accuracy rate of 43.6% means that less than half of songs are being correctly labeled. I attempted to run the program with varying parameters such as the Laplace value, however the accuracy stayed below 50% in most cases, only peaking above in a select few scenarios.


Overall, my main thoughts as to why this particular experiment did not work was due to the similarities in lyrical composition when it comes to Rock and Pop. If you compare the top 10 word choices of R&B compared to something like indie, you get key words that stand out such as “diddley”, however with Rock and Pop, the top 10 words are mainly the same. This means that when a song ends up being input for the test, the values for determining whether the song is positive or negative are extremely close, which can often lead to incorrect outcomes.

While I had only tested Rock vs Pop, it could very well be that in a classifier that uses all of the genres at once values will end up being much more acceptable. This could be a potential for future research for me in this field, however, with the results shown here, it is likely that either Naïve Bayes classifier is not the best classifier for the job, or that classifying music solely based off of lyrics is naturally inaccurate due to the similarities in word choice between genres.

The Code

Github Link:

# Naive bayes classifier to determine songs
# Initially built by Michael Smith and Joshua Niemann for Machine Learning Spam Classifier
# Repurposed and changed by Michael Smith in order to accommodate the lyrics of songs to determine Genre

# imports
import math
import re

# Define variables
training = .7
rockprobability = .6 # Found by distribution of song genre
popprobability = .4

def remove_special_characters(line):
    return (re.sub('[^A-Za-z0-9 ]+', '', line))

def import_data(filename):
    output = []
    for i in open(filename, 'r'):
        split_data = i.split("\t")
        classification_tmp = (0 if split_data[0] == 'pop' else 1)
        data_tmp = (split_data[1])
        output.append([classification_tmp, data_tmp])
    return output

def process(line):
    data = remove_special_characters(line.lower()).split(" ")
    # surprisingly, this is the best way to remove all instances of '' from the data
    return data

def generate_probability_table(pop_wordlist, rock_wordlist):
    ##expect a pop wordlist and rock wordlist with frequencies for each
    # this method determines the frequency based on pop vs rock and returns a table
    freq_table = {}
    total_pop = len(pop_wordlist)
    total_rock = len(rock_wordlist)
    for i in pop_wordlist.keys():
        freq_table[i] = [pop_wordlist[i], 1]
    for i in rock_wordlist.keys():
        if (freq_table.get(i) != None):
            count = freq_table[i]
            count[1] = rock_wordlist[i]
            freq_table[i] = [1, rock_wordlist[i]]
    final_table = {}
    for i in freq_table.keys():
        final_table[i] = [(freq_table[i][0])/ (total_pop * len(freq_table)), (freq_table[i][1])/ (total_rock * len(freq_table))]
    return final_table

def train(data):
    # first thing i'm doing here is going through and finding probabilities
    # this method expects an import_data processed 2d array that contains a 0 or 1 for pop and rock, and a string of text to process.
    pop_wordlist = dict()
    rock_wordlist = dict()
    for i in data:
        processed_words = process(i[1])
        for word in processed_words:
            if (i[0] == 0):
                if (pop_wordlist.get(word) == None):
                    pop_wordlist[word] = 2
                    pop_wordlist[word] = pop_wordlist[word] + 1
            elif (i[0] == 1):
                if (rock_wordlist.get(word) == None):
                    rock_wordlist[word] = 2
                    rock_wordlist[word] = rock_wordlist[word] + 1
    table = generate_probability_table(pop_wordlist, rock_wordlist)
    return table

def calculate(data, trained_table):
    rocktotal = rockprobability
    poptotal = popprobability

    processed_words = process(data[1])
    for word in processed_words:
        if (trained_table.get(word) != None):
            if(trained_table[word][1] != 0):
                rocktotal *= trained_table[word][1]
            if (trained_table[word][0] != 0):
                poptotal *= trained_table[word][0]

    if(rocktotal >= poptotal):
        return 1
        return 0

def calculate_pop(data, trained_table, ppop):
    poptotal = []

    for i in data:
        total = ppop
        processed_words = process(i[1])
        for word in processed_words:
            if (trained_table.get(word) == None):
                total += 0
                if (trained_table[word][1] != 0):
                    total *= trained_table[word][0]
    return poptotal

def main():
    filedata = import_data('SMSrockCollection')
    traindata = filedata[:math.floor(training * len(filedata))]
    testdata = filedata[len(filedata) - math.floor(training * len(filedata)):]  # was missing a :
    table = train(traindata)

    tpcount = 0 # Correctly labeled rock
    tncount = 0 # Correctly labeled pop
    fpcount = 0 # pop labeled as rock
    fncount = 0 # rock labeled as pop

    answers = []
    for i in testdata:
        answers.append(calculate(i, table))

    rockcount = 0
    popcount = 0
    counter = 0

    for i in testdata:
        if i[0] == 1 and answers[counter] == 1:
            tpcount += 1
        elif i[0] == 1 and answers[counter] == 0:
            fpcount += 1
        elif i[0] == 0 and answers[counter] == 0:
            tncount += 1
            fncount += 1
        counter += 1

    accuracy = (tpcount + tncount) / (tpcount + fpcount + tncount + fncount) * 100
    precision = tpcount / (tpcount + fpcount) * 100
    recall = tpcount / (tpcount + fncount) * 100

    print('TP: ', tpcount)
    print('FP: ', fpcount)
    print('TN: ', tncount)
    print('FN: ', fncount)
    print('Accuracy: ', accuracy, '%')
    print('Precision: ', precision, '%')
    print('Recall: ', recall, '%')

    test1 = [0, "dude! dude! look!"]
    test2 = [1, "winner babe! click for prize"]
    print(calculate(test2, table))


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s