Skip to content

Algorithms

phonemenal provides three complementary phonetic similarity algorithms. Each captures a different aspect of how words sound alike, and all return normalized scores between 0.0 (completely different) and 1.0 (identical).

Overview

Algorithm What It Measures Strengths Weaknesses
PPC-A Positional phoneme overlap Captures structural similarity; sensitive to phoneme position Can overweight shared padding in short words
PLD Syllable-level edit distance Mirrors perceptual syllable grouping; stress-aware Requires CMU dict; coarser than phoneme-level
LCS Longest common subsequence Robust to insertions; good fallback Order-sensitive; ignores phoneme position
Composite Weighted average of all three Balanced; configurable Only as good as its components

PPC-A — Positional Phoneme Correlation (Absolute)

PPC-A measures how much of the positional phoneme structure two words share.

How It Works

  1. Retrieve pronunciations from the CMU Pronouncing Dictionary
  2. Build positional combinations — for each phoneme at index i, create pairs with its neighbors: (phoneme_i, phoneme_i+1), (phoneme_i, phoneme_i-1), with boundary padding
  3. Compute set intersection between the two words' combination sets
  4. Normalize by the size of the union
from phonemenal import similarity

score = similarity.ppc("crowd", "crown")

# With details
score, details = similarity.ppc("crowd", "crown", raw=True)
# details includes pronunciations, combo sets, intersection info

When to Use

  • Best for words of similar length where positional overlap matters
  • Good at distinguishing "crowd" vs "crown" (differ only in final consonant) from "crowd" vs "clown" (differ in structure)

PLD — Phoneme Levenshtein Distance

PLD computes edit distance at the syllable level rather than individual phonemes, reflecting how humans perceive speech as syllable groups.

How It Works

  1. Retrieve pronunciations from CMU dict
  2. Split into syllables using stress markers (vowels with 0/1/2 stress digits mark syllable nuclei)
  3. Compute Levenshtein distance between syllable sequences using RapidFuzz, where each syllable is an atomic token
  4. Normalize to 0.0–1.0: 1 - (distance / max_syllables)
from phonemenal import similarity

score = similarity.pld("elastic", "fantastic")

# With details
score, details = similarity.pld("elastic", "fantastic", raw=True)
# details includes syllable breakdowns for both words

When to Use

  • Best for words with different lengths where syllable structure matters
  • Captures that "elastic" and "fantastic" share the "-astic" syllable pattern
  • More perceptually aligned than raw phoneme comparison

LCS — Longest Common Subsequence

LCS finds the longest subsequence common to both phoneme sequences and normalizes by total length.

How It Works

  1. Retrieve phoneme sequences from CMU dict (or fall back to raw characters)
  2. Compute LCS via dynamic programming
  3. Normalize: (2 * lcs_length) / (len(seq1) + len(seq2))
from phonemenal import similarity

# Phoneme-based (default)
score = similarity.lcs("packaging", "packages")

# Character-based fallback (for non-dictionary words)
score = similarity.lcs("pytorch", "pytorche", use_phonemes=False)

# With details
score, details = similarity.lcs("packaging", "packages", raw=True)

When to Use

  • Good fallback when words aren't in CMU dict (with use_phonemes=False)
  • Robust to insertions — handles the "packaging" vs "packages" case well
  • Less sensitive to position than PPC-A

Composite Score

The composite score is a weighted average of all three algorithms:

\[\text{composite} = \frac{w_1 \cdot \text{PPC} + w_2 \cdot \text{PLD} + w_3 \cdot \text{LCS}}{w_1 + w_2 + w_3}\]
from phonemenal import similarity

# Default: equal weights (1.0, 1.0, 1.0)
score = similarity.composite("crowd", "crown")

# Custom weights: emphasize PLD
score = similarity.composite("crowd", "crown", weights=(0.5, 2.0, 0.5))

# Full report with all individual scores
report = similarity.compare("crowd", "crown")

Choosing Weights

Scenario Suggested Weights (PPC, PLD, LCS) Rationale
General purpose (1.0, 1.0, 1.0) Balanced
Short names (3-5 chars) (0.5, 1.0, 1.5) PPC can overweight padding; LCS more stable
Long compound names (1.0, 1.5, 1.0) Syllable structure matters more
Strict matching (1.0, 1.0, 1.0) with high threshold Default weights, raise threshold to 0.85+

Fallback Encoder

When words aren't in the CMU Pronouncing Dictionary — which is common for package names like numpy, pytorch, or fastapi — phonemenal falls back to a Metaphone-inspired encoder.

How It Works

  1. Strip separators (-, _, .), lowercase
  2. Apply digraph replacements: ph→f, ck→k, tion→shn, etc.
  3. Normalize vowels: a,e,i,o,u → A, y → Y
  4. Collapse runs of identical characters
  5. Strip trailing silent-e (only if original ended with e)
from phonemenal import fallback

# Same key = likely homophones
fallback.phonetic_key("numpy")   # → "nAmpY"
fallback.phonetic_key("numpie")  # → "nAmpY"

# Different keys — compare with similarity
k1 = fallback.phonetic_key("flask")
k2 = fallback.phonetic_key("flazk")
fallback.similarity(k1, k2)  # high score

The scanning pipeline automatically falls back to this encoder when the CMU-dict-based composite score returns 0.0.


Choosing the Right Approach

Are both words in the CMU Pronouncing Dictionary?
├── Yes → Use similarity.composite() or similarity.compare()
├── No  → Use fallback.phonetic_key() + fallback.similarity()
└── Mixed / Unsure → Use scanning.scan() (handles fallback automatically)

For batch processing or security scanning, the scanning module handles all of this for you — it tries composite scoring first, falls back to the phonetic encoder when needed, and supports both forward and reverse scanning.