I recently read this article by Will Kurt in which he generates diagrams to show how closely related the lines of a poem are to each other.1 The diagrams look something like this: Each square shows the relation between a pair of lines: a brightly-colored square represents two lines that have a lot of words in common, while a dim square represents lines with few or no words in common. The square in the first row and second column shows the relationship between the first and second lines; the square in the fourth row and seventh column shows the relationship between the fourth and seventh lines, and so on. The diagonal is so bright because each line is identical to itself.

I thought this was a neat way to visualize a poem, and when I heard Daft Punk’s “Get Lucky” I had to create my own visualization. “Get Lucky” consists almost entirely of repeated hooks and I thought it would fun to illustrate that. Along the way I’ll make Kurt’s process more explicit and rigorous. This article will be accessible to anyone who’s taken first-semester linear algebra.

## Songs into vectors

Our goal is to be able to compare two lines of lyrics by quantifying how many words they have in common. To formalize this let’s view each line as a vector.

Vectors are often used to represent directions on a plane or in a space, in which case each component refers to a spatial direction—$x$, $y$, or $z$. In our case, each component will represent a word in the song’s lyrics, and so each vector will tell us which words the line contains and how often.2 Formally, a line of lyrics is a vector whose components are nonnegative integers, at least one of which is nonzero. (We can’t have a line with no words!) The basis for our vectors is the set of all words used in the song. The set needs to be ordered somehow; the order doesn’t matter as long as we pick one and stick with it.

A song is a finite sequence of lines. (It’s a sequence instead of a set because lines may be duplicated and because their order is important.) It’s tempting to think of a song as being a vector space, now that each line is a vector, but actually a song doesn’t have the form of any algebraic structure I can think of. For one thing, our vectors are a sequence, not a set. Even if we pretend they’re a set, there is no meaningful binary operation that is closed over that set: If we combine two vectors by adding their corresponding components, we will not (in general) get a third vector that also happens to be in the set. That is, if we combine the words of two of the lines, we’re probably not going to end up with another one of the lines of the song.

## “Eclipse”

In case these abstract definitions are giving you brain damage, let’s take a simple example from Pink Floyd’s “Eclipse”:

All that you touch
And all that you see
All that you taste
All you feel

Ignoring capitalization and punctuation, we count how many times each word occurs in each line:

Line all and feel see taste that touch you
1 1 0 0 0 0 1 1 1
2 1 1 0 1 0 1 0 1
3 1 0 0 0 1 1 0 1
4 1 0 1 0 0 0 0 1

The words along the top of the table are the basis of our vectors. Relative to this basis, the first line can be represented as the vector

$\mathbf{v}_1 = (1, 0, 0, 0, 0, 1, 1, 1) ,$

indicating that this line contains (for example) one occurrence of “all”, one of “that”, and no occurrences of “see”. If the line had contained “you” twice then the last component would have been 2 instead of 1.

## Comparing lines

Now we need a way to compare two lines to see how similar they are. One obvious way to compare two vectors is with the dot product,

$\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^N A_i B_i ,$

where $N$ is the number of elements in the basis (eight in our example). As we would expect, the dot product of two vectors is zero when they have no words in common and it grows as they have more and more words in common. The problem with the dot product is that not all of our vectors have the same length. For example, $\mathbf{v}_2 \cdot \mathbf{v}_2 = 5$, but $\mathbf{v}_4 \cdot \mathbf{v}_4 = 3$. Ideally, our method of comparing vectors would always give the same result when a vector is compared to itself, regardless of the vector’s length.

We can accomplish this by scaling the dot product by the lengths of the two vectors. The length of one of our vectors is

$|\mathbf{A}| = \sqrt{\sum_{i=1}^N A_i^2} .$

The “scaled dot product” of $\mathbf{A}$ and $\mathbf{B}$, which we’ll call $d(\mathbf{A}, \mathbf{B})$, is

$d(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^N A_i B_i}{\sqrt{\sum_{i=1}^N A_i^2 \cdot \sum_{i=1}^N B_i^2}} .$

You might recognize the middle term in this equation. In geometrical contexts, the dot product of two vectors can be written as

$\mathbf{A} \cdot \mathbf{B} = |\mathbf{A}| |\mathbf{B}| \cos \theta ,$

where $\theta$ is the angle between two vectors. The notion of “angle” doesn’t quite translate into our context (unless you can visualize 11-dimensional space) but the concept is still useful. Putting the last two equations together,

$d(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \cos \theta_{AB} ,$

where $\theta_{AB}$ is the “angle” between $\mathbf{A}$ and $\mathbf{B}$. This distance function $d$ only yields values between zero (no common words) and one (all the same words).3 This makes it possible for us to compare all of the lines in our song pairwise.

(One other thing Will Kurt did in his article was to take the inverse document frequency (IDF) into account in the matrix. This divides each element in the matrix by how often that word appears in the song. Common words then have less of an impact in the matrix—for example, if one pair of lines has only the word “and” in common and another pair has only the word “eclipse” in common, then the latter pair is considered to be more alike, since “eclipse” is a less common word than “and”. This is relatively easy to add to the script, but I chose not to.)

## “Eclipse” again

Let’s go back to our “Eclipse” example, which had the vectors

$\begin{gather} \mathbf{v}_1 = (1, 0, 0, 0, 0, 1, 1, 1) , \\ \mathbf{v}_2 = (1, 1, 0, 1, 0, 1, 0, 1) , \\ \mathbf{v}_3 = (1, 0, 0, 0, 1, 1, 0, 1) , \\ \mathbf{v}_4 = (1, 0, 1, 0, 0, 0, 0, 1) . \end{gather}$

We create a matrix in which the $i, j$ element is $d(\mathbf{v}_i, \mathbf{v}_j)$:

$\begin{bmatrix} \cos \theta_{11} & \cos \theta_{12} & \cos \theta_{13} & \cos \theta_{14} \\ \cos \theta_{21} & \cos \theta_{22} & \cos \theta_{23} & \cos \theta_{24} \\ \cos \theta_{31} & \cos \theta_{32} & \cos \theta_{33} & \cos \theta_{34} \\ \cos \theta_{41} & \cos \theta_{42} & \cos \theta_{43} & \cos \theta_{44} \end{bmatrix} .$

The actual values are

$\begin{bmatrix} 1 & 0.67 & 0.75 & 0.58 \\ 0.67 & 1 & 0.67 & 0.52 \\ 0.75 & 0.67 & 1 & 0.58 \\ 0.58 & 0.52 & 0.58 & 1 \end{bmatrix} .$

All of the values on the main diagonal are ones because every line of lyrics is identical to itself. The matrix is also symmetric because $\cos \theta\_{ij} = \cos \theta\_{ji}$.

## Visualization

Finally we’re ready to turn our songs into images. I wrote a Python script (which you can find on GitHub) to convert a song into a matrix and a matrix into an SVG image. Here’s the image for our section of “Eclipse”: Lighter colors indicate similarity; darker colors indicate dissimilarity. For comparison, here’s all of “Eclipse”: The last two lines (“And everything under the sun is in tune / But the sun is eclipsed by the moon”) aren’t very similar to the rest of the song, so the rightmost two columns and the bottom two rows are pretty dark. Since most of the other lines start with something like “and all that you…”, though, the rest of the area is pretty light.

A song that I’d consider a little more free-form is “At the Bottom of Everything” by Bright Eyes: The phrase “we must” is used pretty frequently in this song, but other than that there aren’t many repeated elements, so the diagram is much darker than the one for “Eclipse”. Even darker is “Change Your Mind” by The Killers: There is a couplet that occurs four times with slight variations (“If the answer is no / Can I change your mind?”). But most of the other words in the song are only used once, so there’s a lot of dark blue in this one.4

On the opposite end of the spectrum, here’s “Get Lucky”: “Get Lucky” has 25 unique lines, repeated for a total of 112 lines. The most repeated line, “We’re up all night to get lucky”, occurs 45 times. No wonder the damn song always gets stuck in my head.

1. The author was also active in the Reddit discussion about his article.↩︎

2. One disadvantage of this approach is that we discard all the information about the relative positions of the words: “all that” and “that all” are treated the same. If you wanted to retain this information you could take the basis to be the text’s n-grams instead of its words.↩︎

3. Proving that $0 \leq \cos \theta_{AB} \leq 1$ is left as an exercise for the reader.↩︎

4. I should point out that how you choose to break the lyrics into lines can have a big effect on the structure of the matrix. Splitting up the lines is usually straightforward, but I thought that “Change Your Mind” in particular required some judgment calls.↩︎