Algorithms¶
phonemenal provides three complementary phonetic similarity algorithms. Each captures a different aspect of how words sound alike, and all return normalized scores between 0.0 (completely different) and 1.0 (identical).
Overview¶
| Algorithm | What It Measures | Strengths | Weaknesses |
|---|---|---|---|
| PPC-A | Positional phoneme overlap | Captures structural similarity; sensitive to phoneme position | Can overweight shared padding in short words |
| PLD | Syllable-level edit distance | Mirrors perceptual syllable grouping; stress-aware | Requires CMU dict; coarser than phoneme-level |
| LCS | Longest common subsequence | Robust to insertions; good fallback | Order-sensitive; ignores phoneme position |
| Composite | Weighted average of all three | Balanced; configurable | Only as good as its components |
PPC-A — Positional Phoneme Correlation (Absolute)¶
PPC-A measures how much of the positional phoneme structure two words share.
How It Works¶
- Retrieve pronunciations from the CMU Pronouncing Dictionary
- Build positional combinations — for each phoneme at index
i, create pairs with its neighbors:(phoneme_i, phoneme_i+1),(phoneme_i, phoneme_i-1), with boundary padding - Compute set intersection between the two words' combination sets
- Normalize by the size of the union
from phonemenal import similarity
score = similarity.ppc("crowd", "crown")
# With details
score, details = similarity.ppc("crowd", "crown", raw=True)
# details includes pronunciations, combo sets, intersection info
When to Use¶
- Best for words of similar length where positional overlap matters
- Good at distinguishing "crowd" vs "crown" (differ only in final consonant) from "crowd" vs "clown" (differ in structure)
PLD — Phoneme Levenshtein Distance¶
PLD computes edit distance at the syllable level rather than individual phonemes, reflecting how humans perceive speech as syllable groups.
How It Works¶
- Retrieve pronunciations from CMU dict
- Split into syllables using stress markers (vowels with 0/1/2 stress digits mark syllable nuclei)
- Compute Levenshtein distance between syllable sequences using RapidFuzz, where each syllable is an atomic token
- Normalize to 0.0–1.0:
1 - (distance / max_syllables)
from phonemenal import similarity
score = similarity.pld("elastic", "fantastic")
# With details
score, details = similarity.pld("elastic", "fantastic", raw=True)
# details includes syllable breakdowns for both words
When to Use¶
- Best for words with different lengths where syllable structure matters
- Captures that "elastic" and "fantastic" share the "-astic" syllable pattern
- More perceptually aligned than raw phoneme comparison
LCS — Longest Common Subsequence¶
LCS finds the longest subsequence common to both phoneme sequences and normalizes by total length.
How It Works¶
- Retrieve phoneme sequences from CMU dict (or fall back to raw characters)
- Compute LCS via dynamic programming
- Normalize:
(2 * lcs_length) / (len(seq1) + len(seq2))
from phonemenal import similarity
# Phoneme-based (default)
score = similarity.lcs("packaging", "packages")
# Character-based fallback (for non-dictionary words)
score = similarity.lcs("pytorch", "pytorche", use_phonemes=False)
# With details
score, details = similarity.lcs("packaging", "packages", raw=True)
When to Use¶
- Good fallback when words aren't in CMU dict (with
use_phonemes=False) - Robust to insertions — handles the "packaging" vs "packages" case well
- Less sensitive to position than PPC-A
Composite Score¶
The composite score is a weighted average of all three algorithms:
from phonemenal import similarity
# Default: equal weights (1.0, 1.0, 1.0)
score = similarity.composite("crowd", "crown")
# Custom weights: emphasize PLD
score = similarity.composite("crowd", "crown", weights=(0.5, 2.0, 0.5))
# Full report with all individual scores
report = similarity.compare("crowd", "crown")
Choosing Weights¶
| Scenario | Suggested Weights (PPC, PLD, LCS) | Rationale |
|---|---|---|
| General purpose | (1.0, 1.0, 1.0) |
Balanced |
| Short names (3-5 chars) | (0.5, 1.0, 1.5) |
PPC can overweight padding; LCS more stable |
| Long compound names | (1.0, 1.5, 1.0) |
Syllable structure matters more |
| Strict matching | (1.0, 1.0, 1.0) with high threshold |
Default weights, raise threshold to 0.85+ |
Fallback Encoder¶
When words aren't in the CMU Pronouncing Dictionary — which is common for package names like numpy, pytorch, or fastapi — phonemenal falls back to a Metaphone-inspired encoder.
How It Works¶
- Strip separators (
-,_,.), lowercase - Apply digraph replacements:
ph→f,ck→k,tion→shn, etc. - Normalize vowels:
a,e,i,o,u → A,y → Y - Collapse runs of identical characters
- Strip trailing silent-e (only if original ended with
e)
from phonemenal import fallback
# Same key = likely homophones
fallback.phonetic_key("numpy") # → "nAmpY"
fallback.phonetic_key("numpie") # → "nAmpY"
# Different keys — compare with similarity
k1 = fallback.phonetic_key("flask")
k2 = fallback.phonetic_key("flazk")
fallback.similarity(k1, k2) # high score
The scanning pipeline automatically falls back to this encoder when the CMU-dict-based composite score returns 0.0.
Choosing the Right Approach¶
Are both words in the CMU Pronouncing Dictionary?
├── Yes → Use similarity.composite() or similarity.compare()
├── No → Use fallback.phonetic_key() + fallback.similarity()
└── Mixed / Unsure → Use scanning.scan() (handles fallback automatically)
For batch processing or security scanning, the scanning module handles all of this for you — it tries composite scoring first, falls back to the phonetic encoder when needed, and supports both forward and reverse scanning.