Algorithms¶

phonemenal provides three complementary phonetic similarity algorithms. Each captures a different aspect of how words sound alike, and all return normalized scores between 0.0 (completely different) and 1.0 (identical).

Overview¶

Algorithm	What It Measures	Strengths	Weaknesses
PPC-A	Positional phoneme overlap	Captures structural similarity; sensitive to phoneme position	Can overweight shared padding in short words
PLD	Syllable-level edit distance	Mirrors perceptual syllable grouping; stress-aware	Requires CMU dict; coarser than phoneme-level
LCS	Longest common subsequence	Robust to insertions; good fallback	Order-sensitive; ignores phoneme position
Composite	Weighted average of all three	Balanced; configurable	Only as good as its components

PPC-A — Positional Phoneme Correlation (Absolute)¶

PPC-A measures how much of the positional phoneme structure two words share.

How It Works¶

Retrieve pronunciations from the CMU Pronouncing Dictionary
Build positional combinations — for each phoneme at index i, create pairs with its neighbors: (phoneme_i, phoneme_i+1), (phoneme_i, phoneme_i-1), with boundary padding
Compute set intersection between the two words' combination sets
Normalize by the size of the union

from phonemenal import similarity

score = similarity.ppc("crowd", "crown")

# With details
score, details = similarity.ppc("crowd", "crown", raw=True)
# details includes pronunciations, combo sets, intersection info

When to Use¶

Best for words of similar length where positional overlap matters
Good at distinguishing "crowd" vs "crown" (differ only in final consonant) from "crowd" vs "clown" (differ in structure)

PLD — Phoneme Levenshtein Distance¶

PLD computes edit distance at the syllable level rather than individual phonemes, reflecting how humans perceive speech as syllable groups.

How It Works¶

Retrieve pronunciations from CMU dict
Split into syllables using stress markers (vowels with 0/1/2 stress digits mark syllable nuclei)
Compute Levenshtein distance between syllable sequences using RapidFuzz, where each syllable is an atomic token
Normalize to 0.0–1.0: 1 - (distance / max_syllables)

from phonemenal import similarity

score = similarity.pld("elastic", "fantastic")

# With details
score, details = similarity.pld("elastic", "fantastic", raw=True)
# details includes syllable breakdowns for both words

When to Use¶

Best for words with different lengths where syllable structure matters
Captures that "elastic" and "fantastic" share the "-astic" syllable pattern
More perceptually aligned than raw phoneme comparison

LCS — Longest Common Subsequence¶

LCS finds the longest subsequence common to both phoneme sequences and normalizes by total length.

How It Works¶

Retrieve phoneme sequences from CMU dict (or fall back to raw characters)
Compute LCS via dynamic programming
Normalize: (2 * lcs_length) / (len(seq1) + len(seq2))

from phonemenal import similarity

# Phoneme-based (default)
score = similarity.lcs("packaging", "packages")

# Character-based fallback (for non-dictionary words)
score = similarity.lcs("pytorch", "pytorche", use_phonemes=False)

# With details
score, details = similarity.lcs("packaging", "packages", raw=True)

When to Use¶

Good fallback when words aren't in CMU dict (with use_phonemes=False)
Robust to insertions — handles the "packaging" vs "packages" case well
Less sensitive to position than PPC-A

Composite Score¶

The composite score is a weighted average of all three algorithms:

\[\text{composite} = \frac{w_1 \cdot \text{PPC} + w_2 \cdot \text{PLD} + w_3 \cdot \text{LCS}}{w_1 + w_2 + w_3}\]

from phonemenal import similarity

# Default: equal weights (1.0, 1.0, 1.0)
score = similarity.composite("crowd", "crown")

# Custom weights: emphasize PLD
score = similarity.composite("crowd", "crown", weights=(0.5, 2.0, 0.5))

# Full report with all individual scores
report = similarity.compare("crowd", "crown")

Choosing Weights¶

Scenario	Suggested Weights (PPC, PLD, LCS)	Rationale
General purpose	`(1.0, 1.0, 1.0)`	Balanced
Short names (3-5 chars)	`(0.5, 1.0, 1.5)`	PPC can overweight padding; LCS more stable
Long compound names	`(1.0, 1.5, 1.0)`	Syllable structure matters more
Strict matching	`(1.0, 1.0, 1.0)` with high threshold	Default weights, raise threshold to 0.85+

Fallback Encoder¶

When words aren't in the CMU Pronouncing Dictionary — which is common for package names like numpy, pytorch, or fastapi — phonemenal falls back to a Metaphone-inspired encoder.

How It Works¶

Strip separators (-, _, .), lowercase
Apply digraph replacements: ph→f, ck→k, tion→shn, etc.
Normalize vowels: a,e,i,o,u → A, y → Y
Collapse runs of identical characters
Strip trailing silent-e (only if original ended with e)

from phonemenal import fallback

# Same key = likely homophones
fallback.phonetic_key("numpy")   # → "nAmpY"
fallback.phonetic_key("numpie")  # → "nAmpY"

# Different keys — compare with similarity
k1 = fallback.phonetic_key("flask")
k2 = fallback.phonetic_key("flazk")
fallback.similarity(k1, k2)  # high score

The scanning pipeline automatically falls back to this encoder when the CMU-dict-based composite score returns 0.0.

Choosing the Right Approach¶

Are both words in the CMU Pronouncing Dictionary?
├── Yes → Use similarity.composite() or similarity.compare()
├── No  → Use fallback.phonetic_key() + fallback.similarity()
└── Mixed / Unsure → Use scanning.scan() (handles fallback automatically)

For batch processing or security scanning, the scanning module handles all of this for you — it tries composite scoring first, falls back to the phonetic encoder when needed, and supports both forward and reverse scanning.