By Tyler Diamond
Natural Language Processing can be considered the primary initial driving force in machine learning, and perhaps even in the computer revolution itself. Over the decades, research and increased computational power has allowed us to make some rather impressive strides in the area of NLP. In this blogpost, I will talk about the process of creating a rap lyric generator through the use of Markov Chains and neural networks.
Like nearly all machine learning tasks, we need a fairly large amount of data, known as a corpus when dealing with textual data, in order to train our models. Thankfully the company Genius (formally Rap Genius) has created the cutting-edge platform for lyric storage. They accept lyrics for all genres of music, and allow the community to annotate the lyrics and provide explanations. This community-driven model helps a huge amount of songs, old and new, to be included in their records. As such, I utilized their API and website in order to scrape lyric data. Unfortunately, Genius provides all info EXCEPT lyrics in their API, certainly an interesting choice.
In addition to downloading the lyric data, we must also clean it up (known as preprocessing) in order to have our corpus consistent across all songs. Genius provides some annotations in the lyrics themselves, such as which lyrics are for the first verse, or the chorus, etc. Most of these annotations are in brackets (e.x [Verse 1]). I also normalize the data such as removing punctuation and lower casing all the letters.
Additionally, I also tagged the end of lines and verses, from the intuition explained in the research paper GhostWriter: Using an LSTM for Automatic Rap Lyric Generation (Potash et al.).
In total, about 100,000 lines were used to generate raps, from 4 artists and approximately 1800 songs. These artists are some of my favorites: Lupe Fiasco, Killer Mike, Kendrick Lamar and Eminem.
A diagram of my workflow for retrieving and preprocessing data can be seen below:
Markov Chains have found widespread use in areas of text generation, due to their simple structure in addition to their ability to generation realistic sentences given a large enough corpus. Many summary bots, spam bots and other type of automatic text generation bots rely on Markov chains at their core.
Markov chains have the property of satisfying the Markov property, in that one can make a prediction of future states based on only knowledge of the present state, therefore independent of the history of use of the chain.
As seen above, if one is in State A there is either a 40% chance of moving to State E, or a 60% chance of looping back into State A. This is a very simply chain, and the chains in use by the rap lyric generator are much more complex.
The easy-to-use python library Markovify is used, as you can create a Markov chain from your corpus in a single line, and generate sentences in a single line.
For an introduction to neural networks, please see my previous blogpost on Quick n Dirty Facial Detection
Recurrent Neural Networks (RNN) are a subsection of neural networks in which the output of previous inputs affects the output of the current input. This is useful for all types of natural language processing tasks that rely on a sequence of data, such as text prediction or speech recognition. Vanilla RNNs suffer from a problem in which data seen many time-steps ago loses its effect on the current output. To solve this, the Long-Short Term Memory (LSTM) RNN has been created.
I will not go into the low-level details of how LSTMs work, however two images comparing vanilla RNN and LSTM can be looked at below. These images are taken from Cohan’s tutorial that explains LSTMs very well.
Figure 2: RNN
Figure 3: LSTM
A simpler image of how LSTM and RNN are used can be seen below, with less detail than the above images:
As can be seen, LSTMs have a more complex behavior, which allows short term dependencies as well as long term dependencies to remain influential throughout time. Due to this increased complexity, LSTM networks require a fairly large amount of computational power to train.
The initial model
After following a fantastic tutorial on text generation with LSTM using Alice in Wonderland as the training corpus, I figured I could attempt to port this tutorial. After training the network on my corpus for a day on a 1080 Ti, I attempted to generate a rap.
Finally, the moment I’ve been waiting for. The promise of dope bars was within my reach and I was excited to see what my network would generate. Perhaps it wouldn’t rhyme, but hopefully it would generate some fun lyrics! Let’s see the results…
Well that was…disappointing. There is no denying the influence of “the streets” in the rap game and how often rapper attribute their pitfalls and successes to “the streets”. However, I’d argue most people would not consider someone who repeats “the streets” over and over again a dope MC.
With this failure, I had to go back to the drawing board and contemplate how I could change my model to produce some actual lyrics.
Since Markov chains can be used to generate randomish text pretty well and have low computational overhead, why don’t we generate the lyrics just simply using a Markov chain?
Well then what about rhyming? Well we can simply extract the last syllables from the Markov-generated lines and then find rhyming words using the CMU Pronouncing Dictionary.
What we do is for every last word of the generated lyrics, we ask the CMU pronouncing dictionary what words rhyme with this word. We then extract the last syllable of each rhyming word, as this is usually a pretty good indicator of rhyming (although not perfect, this should be expanded to include multi-syllabic rhymes). Then the last syllable that appears the most for rhyming with specific word is what we consider its “rhyming” word ending.
After getting all the rhyming endings, we create our dictionary of raps, a corpus of sorts. We create an entry for each generated sentence from the Markov model. We then include the index of the rhyme ending for this line and also the syllable count of the sentence. Since syllable consistency affects how smooth a rap is, we want to have lines with similar counts.
Now we create our input and output datasets (since supervised neural networks require an input and output to train on). The structure of the input (X) and output (Y) are seen below:
Our input and output are simply 2×1 vectors. Each training example holds the rhyme and syllable count for the current line, and we use this information to predict the rhyme and syllable count of the next line. These numbers are actually divided by their maximum values (number of rhyme endings for rhyme and a large syllable amount (ex: 30) for syllable), as neural networks train best when all data is between 0 and 1.
I actually created two separate models to generate text with. The first was a non-recurrent neural network that had 2 lines per input and output (size of 4×1) so we can train on pairs. The second model was trained using an LSTM, and we only need the 2×1 vectors for each line since LSTMs keep data overtime. LSTM training matrices, however, must include the previous sequences in its training data.
Therefore the two X sizes were either:
1. Data size x 4 x 1
2. Data size x sequence length x 2
And the two network architectures:
As can be seen above, the LSTM has some different aspects than the normal deep neural network. First off, it takes in a 3-dimensional matrix as its input, in order to handle the previously seen sequences. Additionally, it drops 50% of the exams every epoch it trains in order to avoid overfitting.
I trained both these networks with a batch size of 32. The batch size is simply the amount of training examples trained at once. Even though all training examples are trained each ‘epoch”, they are trained separately, split into batches as this assists in generalizing the training of the network on all the data.
After training the network, we then predict the lines we should use after choosing a randomly starting line from the Markov sentences. We will then go through the dictionary of all our lines, and decide which lines to use in our raps based upon those that most closely fit the syllable count and rhyming of the output of our neural network.
This method, compared to “the streets” method, performs much faster. All in all it takes about 4 minutes to train and generate sentences, depending on corpus size.
So after all this training and text generation, do we have any results? Well let’s see…
You may have noticed a few things that make this rap to appear not all too good. First off, it repeats the last word a good amount. This is something that could be manually prevented by not including a line if the last word has been seen N times already. Additionally, this example above and the one at the top of this post do not have very good flow, the syllable counts are everywhere! Well for some examples I turned off the syllable count, as it requires a much greater amount of Markov words to be generated and sometimes very few lines have a consistent syllable count AND rhyme scheme.
However, here is an example with relatively similar syllables:
Both of these raps definitely could have improvement in the rhyming ability. Such as adding multi-syllabic rhymes and also working on a phoneme level instead of character level, as just because the last letters match does not mean they rhyme.
As stated above, improving the rhyming is the top priority for future work on this project. Additionally, perhaps hardcoding the ability for different rhyme schemes would be cool. I also would like to figure out how to remove the Markov generation part as shown in other research papers and rely solely on neural networks.