• Son Luu

Predictive Text with Python

Updated: Apr 16, 2019

Class: Reading and Writing Text

Professor: Allison Parish

This week, we are tapping into teaching the computer to generate text on behalf of humans. I've been looking forward to this particular part of the course.

  • Use predictive models to generate text: either a Markov chain or an RNN, or both.

  • How does your choice of source text affect the output?

  • Try combining predictive text with other methods we’ve used for analyzing and generating text: use RNN-generated text to fill Tracery templates, or train a Markov model on the output of parsing parts of speech from a text, or some other combination.

  • What works and what doesn’t?

  • How does RNN-generated text “feel” different from Markov-generated text?

  • How does the length of the n-gram and the unit of the n-gram affect the quality of the output?


As we are using an "old, classic, first-generation" method of generative text: the Markov chain, I was thinking of using a text source that also represents a "classic" source.

For some reason, "The Phantom of the Opera" broadway show came to mind as a classic masterpiece in both storyline as well as beautiful music.

I decided to use a song in the musical that I really liked and see if I would be able to manipulate the lyrics and observe the changes in its mood and emotions.

The song is "Think of Me" written by: Andrew Lloyd Webber, Charles Hart, and Richard Stilgoe.

I chose this text also because of the repetitive phrase "think of", which could potentially offer an interesting probabilistic series of scenarios for the newly generated lyrics.


We started by defining the generative text function, using the Markov chain model.

We instructed the computer to randomly pick what could possibly follow the phrase "think of" out of all the possible chances identified within the original text, and then generate a new excerpt, using the Markov model.

Reflection on the result

Based on the result, using the Markov chain model:

  • How does your choice of source text affect the output? First of all, the choice of source text was short and simple. so the probability of the generative text was not too wildly unexpected. Second of all, due to using only one repetitive phrase "think of", the generated text stayed pretty close to the original context. Here and there, the newly generated text provided some interesting contexts, but for the most part, I think it remained easy to understand, as well as maintained the overall emotional feelings of the original text.

  • What works and what doesn’t? Because the Markov model doesn't take into account the meanings and the context of the words, sometimes, the generated text doesn't quite go with the overall context effectively.

  • How does the length of the n-gram and the unit of the n-gram affect the quality of the output? Although I didn't play with the n-gram length, I would expect it to further complicate the generated text, and may not make logical sense in this specific case. The fact that I limited the generated text around the phrase "think of" in this short amount of words, allowed the generated text to be more defined and constrained. Thus, there was little room for the generated text to stray too far from the overall mood and emotions of the original text.