NLP on Shakespeare’s Phonemes

This post explores my undergraduate thesis on analyzing Shakespeare’s sonnets through their sounds rather than their meanings. By converting poems to phonemes and using TF-IDF, we come up a new way to visualize the sonic relationships between different works of poetry using phylogenetic trees.

The Core Idea

What if we could visualize poems not by their meanings, but by their sounds?

In my undergraduate thesis, I worked on analyzing Shakespeare’s 154 sonnets through their phonemic content - the fundamental sounds that make up spoken language. The key insight was treating poems as 39-dimensional vectors (one dimension per English phoneme) and using computational techniques borrowed from bioinformatics to reveal their sonic relationships.

Why Sound Matters

Poetry is fundamentally a sonic art form. While we often analyze poems for their meaning, metaphors, and structure, the sounds themselves create an aesthetic experience that transcends semantic content. Consider how alliteration, assonance, and consonance shape our experience of a poem - these devices work purely through sound, not meaning.

By converting Shakespeare’s words into phonemes using the CMU Pronouncing Dictionary, I could capture these sonic patterns computationally. Each sonnet became a frequency distribution of sounds, revealing its unique acoustic fingerprint.

The Technical Approach

The methodology combined techniques from natural language processing and bioinformatics:

Phoneme Conversion: Using ARPAbet (39 phonemes represented as ASCII characters), I converted each sonnet from text to a sequence of sounds. This required manually adding pronunciations for archaic words not in the CMU dictionary.
TF-IDF on Phonemes: Adapting TF-IDF from information retrieval, I weighted phonemes by their distinctiveness. Just as TF-IDF identifies important words by downweighting common ones like “the”, my phonemic TF-IDF identified which sounds were deliberately chosen versus grammatically necessary.
Phylogenetic Trees: Borrowing from evolutionary biology, I used the neighbor-joining algorithm to build trees showing phonemic relationships between sonnets. The algorithm iteratively joins the closest poems based on cosine distance between their phoneme vectors.

Key Findings

The analysis revealed several fascinating patterns:

Phonemic Sparsity: Unlike words (where only a small fraction of the English vocabulary appears in any given poem), phonemes are dense - nearly all 39 phonemes appear in every sonnet. This required adapting TF-IDF differently than its traditional application.

Linguistic Clustering: By grouping phonemes by manner of articulation (plosives like P/B/T/D, fricatives like F/V/S/Z, etc.), we reduced the 39-dimensional space to 5 dimensions. This revealed that Sonnet 18 (“Shall I compare thee…”) has 45% plosives while Sonnet 116 (“Let me not to the marriage…”) has only 28% - creating very different sonic textures.

Surprising Uniformity: The top 5 phonemes by TF-IDF score were nearly identical across all sonnets:

AH0 (schwa sound, as in “about”)
T (as in “time”)
N (as in “not”)
S (as in “sweet”)
L (as in “love”)

This uniformity was unexpected - it seems Shakespeare maintained a consistent sonic palette, with variation coming from the middle-frequency phonemes rather than the extremes.

Future Directions: Word2Vec on Phonemes

The natural next step is incorporating sequential information. While TF-IDF captures frequency, it misses the crucial aspect of phoneme ordering and context. This is where Word2Vec becomes exciting.

Training Word2Vec on phoneme sequences could reveal:

Phoneme transition probabilities (which sounds naturally follow others)
Context-dependent sound choices (how surrounding phonemes influence selection)
Comparison with prose phoneme embeddings to quantify what makes poetry “sound poetic”

The implementation would treat each phoneme as a “word” and each line as a “sentence”:

Original: "When in disgrace with fortune and men's eyes"
Phonemes: W EH1 N IH0 N D IH0 S G R EY1 S W IH1 DH F AO1 R CH AH0 N AE1 N D M EH1 N Z AY1 Z

Limitations and Learnings

This approach has clear limitations:

Modern pronunciation vs. Elizabethan English
Loss of prosodic information (stress, timing, pauses)
The neighbor-joining algorithm’s non-determinism meant running it multiple times to find stable clusters

That said, we demonstrate how computational methods can reveal patterns in literature that traditional analysis might miss.

Slides presentation below

📊 View or Download Slides (PDF)

Full thesis below

📄 View or Download Full Thesis (PDF)