similarity¶
Phonetic similarity scoring with four complementary algorithms.
All public functions return normalized scores between 0.0 (completely different) and 1.0 (identical). Pass raw=True to get a (score, details) tuple with intermediate computation data.
phonemenal.similarity
¶
Four phonetic similarity algorithms, all normalized to 0.0–1.0.
-
PPC-A (Positional Phoneme Correlation — Absolute): Measures overlap of positional phoneme patterns between two words. Based on building forward and reverse phoneme combinations with positional padding.
-
PLD (Phoneme Levenshtein Distance): Syllable-level edit distance using rapidfuzz. Treats each syllable as an atomic unit so distance reflects how many whole syllables differ.
-
PED (Phoneme Edit Distance): Phoneme-level edit distance using stress- stripped CMU pronunciations. Complements PLD for short or monosyllabic pairs where syllable-level scoring is too coarse.
-
LCS (Longest Common Subsequence): Ratio-based scoring on phonetic keys or raw phoneme sequences.
Composite scoring combines PPC-A, an edit channel, and LCS with configurable weights and edit-mode selection.
ppc(word1: str, word2: str, *, raw: bool = False) -> float | tuple[float, dict]
¶
Positional Phoneme Correlation — Absolute (PPC-A).
Builds positional phoneme combinations by traversing forward and reverse directions with padding, then measures set intersection.
Returns normalized score 0.0–1.0 (higher = more similar). If raw=True, returns (score, details_dict) with intermediate values.
Source code in phonemenal/similarity.py
pld(word1: str, word2: str, *, raw: bool = False) -> float | tuple[float, dict]
¶
Phoneme Levenshtein Distance at syllable level.
Each syllable is treated as an atomic unit (tuple of phonemes). Distance is computed between syllable sequences, then normalized to 0.0–1.0 where 1.0 = identical and 0.0 = maximally different.
If raw=True, returns (score, details_dict).
Source code in phonemenal/similarity.py
ped(word1: str, word2: str, *, raw: bool = False) -> float | tuple[float, dict]
¶
Phoneme edit distance on stress-stripped pronunciations.
This operates at the phoneme level rather than the syllable level. It is especially useful for short words and monosyllables where syllable-level PLD often collapses to either 1.0 or 0.0.
Source code in phonemenal/similarity.py
lcs(word1: str, word2: str, *, use_phonemes: bool = True, raw: bool = False) -> float | tuple[float, dict]
¶
Longest Common Subsequence ratio.
When use_phonemes=True (default), compares phoneme sequences from CMU dict. When use_phonemes=False, compares raw character strings (useful as fallback for words not in the dictionary).
Returns 0.0–1.0 where 1.0 = identical sequences. If raw=True, returns (score, details_dict).
Source code in phonemenal/similarity.py
composite(word1: str, word2: str, *, weights: tuple[float, float, float] = (1.0, 2.0, 1.0), edit_mode: str = 'max', raw: bool = False) -> float | tuple[float, dict]
¶
Weighted composite of PPC-A, edit distance, and LCS scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
word1
|
str
|
First word to compare. |
required |
word2
|
str
|
Second word to compare. |
required |
weights
|
tuple[float, float, float]
|
(ppc_weight, edit_weight, lcs_weight). Default weights emphasize edit similarity. |
(1.0, 2.0, 1.0)
|
edit_mode
|
str
|
"max" (default) uses the stronger of PLD/PED. "length" uses PED for monosyllable-vs-monosyllable pairs and PLD otherwise. |
'max'
|
raw
|
bool
|
If True, return (composite_score, details_dict). |
False
|
Returns:
| Type | Description |
|---|---|
float | tuple[float, dict]
|
Composite similarity score between 0.0 and 1.0. |
Source code in phonemenal/similarity.py
compare(word1: str, word2: str, *, weights: Optional[tuple[float, float, float]] = None, edit_mode: str = 'max') -> dict
¶
Full comparison report between two words.
Returns a dict with all individual scores, composite score, and pronunciation details.