Algorithms¶
phonemenal provides four complementary phonetic similarity algorithms. Each captures a different aspect of how words sound alike, and all return normalized scores between 0.0 (completely different) and 1.0 (identical).
Overview¶
| Algorithm | What It Measures | Strengths | Weaknesses |
|---|---|---|---|
| PPC-A | Positional phoneme overlap | Captures structural similarity; sensitive to phoneme position | Can overweight shared padding in short words |
| PLD | Syllable-level edit distance | Mirrors perceptual syllable grouping; stress-aware | Requires CMU dict; coarser than phoneme-level |
| PED | Phoneme-level edit distance | Strong on short words and one-phoneme substitutions | Less syllable-aware than PLD |
| LCS | Longest common subsequence | Robust to insertions; good fallback | Order-sensitive; ignores phoneme position |
| Composite | Weighted average of PPC, edit, and LCS | Balanced; configurable | Only as good as its components |
PPC-A — Positional Phoneme Correlation (Absolute)¶
PPC-A measures how much of the positional phoneme structure two words share.
How It Works¶
- Retrieve pronunciations from the CMU Pronouncing Dictionary
- Build positional combinations — for each phoneme at index
i, create pairs with its neighbors:(phoneme_i, phoneme_i+1),(phoneme_i, phoneme_i-1), with boundary padding - Compute set intersection between the two words' combination sets
- Normalize by the size of the union
from phonemenal import similarity
score = similarity.ppc("crowd", "crown")
# With details
score, details = similarity.ppc("crowd", "crown", raw=True)
# details includes pronunciations, combo sets, intersection info
When to Use¶
- Best for words of similar length where positional overlap matters
- Good at distinguishing "crowd" vs "crown" (differ only in final consonant) from "crowd" vs "clown" (differ in structure)
PLD — Phoneme Levenshtein Distance¶
PLD computes edit distance at the syllable level rather than individual phonemes, reflecting how humans perceive speech as syllable groups.
How It Works¶
- Retrieve pronunciations from CMU dict
- Split into syllables using stress markers (vowels with 0/1/2 stress digits mark syllable nuclei)
- Compute Levenshtein distance between syllable sequences using RapidFuzz, where each syllable is an atomic token
- Normalize to 0.0–1.0:
1 - (distance / max_syllables)
from phonemenal import similarity
score = similarity.pld("elastic", "fantastic")
# With details
score, details = similarity.pld("elastic", "fantastic", raw=True)
# details includes syllable breakdowns for both words
When to Use¶
- Best for words with different lengths where syllable structure matters
- Captures that "elastic" and "fantastic" share the "-astic" syllable pattern
- More perceptually aligned than raw phoneme comparison
LCS — Longest Common Subsequence¶
LCS finds the longest subsequence common to both phoneme sequences and normalizes by total length.
How It Works¶
- Retrieve phoneme sequences from CMU dict (or fall back to raw characters)
- Compute LCS via dynamic programming
- Normalize:
(2 * lcs_length) / (len(seq1) + len(seq2))
from phonemenal import similarity
# Phoneme-based (default)
score = similarity.lcs("packaging", "packages")
# Character-based fallback (for non-dictionary words)
score = similarity.lcs("pytorch", "pytorche", use_phonemes=False)
# With details
score, details = similarity.lcs("packaging", "packages", raw=True)
When to Use¶
- Good fallback when words aren't in CMU dict (with
use_phonemes=False) - Robust to insertions — handles the "packaging" vs "packages" case well
- Less sensitive to position than PPC-A
PED — Phoneme Edit Distance¶
PED computes edit distance at the phoneme level after stripping stress markers from CMU pronunciations.
How It Works¶
- Retrieve pronunciations from CMU dict
- Strip stress markers from vowels (
AH0→AH) - Compute Levenshtein distance between phoneme sequences
- Normalize to 0.0–1.0:
1 - (distance / max_phonemes)
from phonemenal import similarity
score = similarity.ped("cat", "bat")
# With details
score, details = similarity.ped("cat", "bat", raw=True)
When to Use¶
- Best for short words and monosyllables
- Good at capturing one-phoneme substitutions like
"cat"vs"bat" - Useful alongside PLD rather than as a replacement for it
Composite Score¶
The composite score is a weighted average of PPC-A, an adaptive edit channel, and LCS:
from phonemenal import similarity
# Default: emphasize edit similarity (1.0, 2.0, 1.0)
score = similarity.composite("crowd", "crown")
# Custom weights: emphasize the edit channel even more
score = similarity.composite("crowd", "crown", weights=(0.5, 2.0, 0.5))
# Length-based edit selection: PED for monosyllables, PLD otherwise
score = similarity.composite("cat", "bat", edit_mode="length")
# Full report with all individual scores
report = similarity.compare("crowd", "crown")
Choosing Weights¶
| Scenario | Suggested Weights (PPC, Edit, LCS) | Rationale |
|---|---|---|
| General purpose | (1.0, 2.0, 1.0) |
Stronger edit sensitivity for near-homophones |
| Short names (3-5 chars) | (0.5, 2.0, 0.5) |
De-emphasize PPC padding effects |
| Long compound names | (1.0, 1.5, 1.0) |
Keep syllable structure relevant via PLD |
| Strict matching | (1.0, 2.0, 1.0) with high threshold |
Default weights, raise threshold to 0.85+ |
edit_mode controls how the edit channel is chosen:
max(default): use whichever ofPLDorPEDscores higherlength: usePEDfor monosyllable-vs-monosyllable pairs,PLDotherwise
Fallback Encoder¶
When words aren't in the CMU Pronouncing Dictionary — which is common for package names like numpy, pytorch, or fastapi — phonemenal falls back to a Metaphone-inspired encoder.
How It Works¶
- Strip separators (
-,_,.), lowercase - Apply digraph replacements:
ph→f,ck→k,tion→shn, etc. - Normalize vowels:
a,e,i,o,u → A,y → Y - Collapse runs of identical characters
- Strip trailing silent-e (only if original ended with
e)
from phonemenal import fallback
# Same key = likely homophones
fallback.phonetic_key("numpy") # → "nAmpY"
fallback.phonetic_key("numpie") # → "nAmpY"
# Different keys — compare with similarity
k1 = fallback.phonetic_key("flask")
k2 = fallback.phonetic_key("flazk")
fallback.similarity(k1, k2) # high score
The scanning pipeline automatically falls back to this encoder when one or both names are missing from CMU pronunciations.
Choosing the Right Approach¶
Are both words in the CMU Pronouncing Dictionary?
├── Yes → Use similarity.composite() or similarity.compare()
├── No → Use fallback.phonetic_key() + fallback.similarity()
└── Mixed / Unsure → Use scanning.scan() (handles fallback automatically)
For batch processing or security scanning, the scanning module handles all of this for you — it tries composite scoring first, falls back to the phonetic encoder when needed, and supports both forward and reverse scanning.