Algorithms¶

phonemenal provides four complementary phonetic similarity algorithms. Each captures a different aspect of how words sound alike, and all return normalized scores between 0.0 (completely different) and 1.0 (identical).

Overview¶

Algorithm	What It Measures	Strengths	Weaknesses
PPC-A	Positional phoneme overlap	Captures structural similarity; sensitive to phoneme position	Can overweight shared padding in short words
PLD	Syllable-level edit distance	Mirrors perceptual syllable grouping; stress-aware	Requires CMU dict; coarser than phoneme-level
PED	Phoneme-level edit distance	Strong on short words and one-phoneme substitutions	Less syllable-aware than PLD
LCS	Longest common subsequence	Robust to insertions; good fallback	Order-sensitive; ignores phoneme position
Composite	Weighted average of PPC, edit, and LCS	Balanced; configurable	Only as good as its components

PPC-A — Positional Phoneme Correlation (Absolute)¶

PPC-A measures how much of the positional phoneme structure two words share.

How It Works¶

Retrieve pronunciations from the CMU Pronouncing Dictionary
Build positional combinations — for each phoneme at index i, create pairs with its neighbors: (phoneme_i, phoneme_i+1), (phoneme_i, phoneme_i-1), with boundary padding
Compute set intersection between the two words' combination sets
Normalize by the size of the union

from phonemenal import similarity

score = similarity.ppc("crowd", "crown")

# With details
score, details = similarity.ppc("crowd", "crown", raw=True)
# details includes pronunciations, combo sets, intersection info

When to Use¶

Best for words of similar length where positional overlap matters
Good at distinguishing "crowd" vs "crown" (differ only in final consonant) from "crowd" vs "clown" (differ in structure)

PLD — Phoneme Levenshtein Distance¶

PLD computes edit distance at the syllable level rather than individual phonemes, reflecting how humans perceive speech as syllable groups.

How It Works¶

Retrieve pronunciations from CMU dict
Split into syllables using stress markers (vowels with 0/1/2 stress digits mark syllable nuclei)
Compute Levenshtein distance between syllable sequences using RapidFuzz, where each syllable is an atomic token
Normalize to 0.0–1.0: 1 - (distance / max_syllables)

from phonemenal import similarity

score = similarity.pld("elastic", "fantastic")

# With details
score, details = similarity.pld("elastic", "fantastic", raw=True)
# details includes syllable breakdowns for both words

When to Use¶

Best for words with different lengths where syllable structure matters
Captures that "elastic" and "fantastic" share the "-astic" syllable pattern
More perceptually aligned than raw phoneme comparison

LCS — Longest Common Subsequence¶

LCS finds the longest subsequence common to both phoneme sequences and normalizes by total length.

How It Works¶

Retrieve phoneme sequences from CMU dict (or fall back to raw characters)
Compute LCS via dynamic programming
Normalize: (2 * lcs_length) / (len(seq1) + len(seq2))

from phonemenal import similarity

# Phoneme-based (default)
score = similarity.lcs("packaging", "packages")

# Character-based fallback (for non-dictionary words)
score = similarity.lcs("pytorch", "pytorche", use_phonemes=False)

# With details
score, details = similarity.lcs("packaging", "packages", raw=True)

When to Use¶

Good fallback when words aren't in CMU dict (with use_phonemes=False)
Robust to insertions — handles the "packaging" vs "packages" case well
Less sensitive to position than PPC-A

PED — Phoneme Edit Distance¶

PED computes edit distance at the phoneme level after stripping stress markers from CMU pronunciations.

How It Works¶

Retrieve pronunciations from CMU dict
Strip stress markers from vowels (AH0 → AH)
Compute Levenshtein distance between phoneme sequences
Normalize to 0.0–1.0: 1 - (distance / max_phonemes)

from phonemenal import similarity

score = similarity.ped("cat", "bat")

# With details
score, details = similarity.ped("cat", "bat", raw=True)

When to Use¶

Best for short words and monosyllables
Good at capturing one-phoneme substitutions like "cat" vs "bat"
Useful alongside PLD rather than as a replacement for it

Composite Score¶

The composite score is a weighted average of PPC-A, an adaptive edit channel, and LCS:

\[\text{composite} = \frac{w_1 \cdot \text{PPC} + w_2 \cdot \text{Edit} + w_3 \cdot \text{LCS}}{w_1 + w_2 + w_3}\]

from phonemenal import similarity

# Default: emphasize edit similarity (1.0, 2.0, 1.0)
score = similarity.composite("crowd", "crown")

# Custom weights: emphasize the edit channel even more
score = similarity.composite("crowd", "crown", weights=(0.5, 2.0, 0.5))

# Length-based edit selection: PED for monosyllables, PLD otherwise
score = similarity.composite("cat", "bat", edit_mode="length")

# Full report with all individual scores
report = similarity.compare("crowd", "crown")

Choosing Weights¶

Scenario	Suggested Weights (PPC, Edit, LCS)	Rationale
General purpose	`(1.0, 2.0, 1.0)`	Stronger edit sensitivity for near-homophones
Short names (3-5 chars)	`(0.5, 2.0, 0.5)`	De-emphasize PPC padding effects
Long compound names	`(1.0, 1.5, 1.0)`	Keep syllable structure relevant via PLD
Strict matching	`(1.0, 2.0, 1.0)` with high threshold	Default weights, raise threshold to 0.85+

edit_mode controls how the edit channel is chosen:

max (default): use whichever of PLD or PED scores higher
length: use PED for monosyllable-vs-monosyllable pairs, PLD otherwise

Fallback Encoder¶

When words aren't in the CMU Pronouncing Dictionary — which is common for package names like numpy, pytorch, or fastapi — phonemenal falls back to a Metaphone-inspired encoder.

How It Works¶

Strip separators (-, _, .), lowercase
Apply digraph replacements: ph→f, ck→k, tion→shn, etc.
Normalize vowels: a,e,i,o,u → A, y → Y
Collapse runs of identical characters
Strip trailing silent-e (only if original ended with e)

from phonemenal import fallback

# Same key = likely homophones
fallback.phonetic_key("numpy")   # → "nAmpY"
fallback.phonetic_key("numpie")  # → "nAmpY"

# Different keys — compare with similarity
k1 = fallback.phonetic_key("flask")
k2 = fallback.phonetic_key("flazk")
fallback.similarity(k1, k2)  # high score

The scanning pipeline automatically falls back to this encoder when one or both names are missing from CMU pronunciations.

Choosing the Right Approach¶

Are both words in the CMU Pronouncing Dictionary?
├── Yes → Use similarity.composite() or similarity.compare()
├── No  → Use fallback.phonetic_key() + fallback.similarity()
└── Mixed / Unsure → Use scanning.scan() (handles fallback automatically)

For batch processing or security scanning, the scanning module handles all of this for you — it tries composite scoring first, falls back to the phonetic encoder when needed, and supports both forward and reverse scanning.