Skip to content

phonemenal logo

phonemenal

Phonetic similarity and homophone detection library for Python — near-homophones, sound-alike collisions, and variant generation.

phonemenal is a general-purpose phonetic similarity library that answers the question: "do these two words sound the same?" It is used for typosquatting detection, supply-chain security audits, linguistics research, brand-name collision checks, and any workflow that compares how words sound.


Why phonemenal?

Traditional string-distance tools (Levenshtein, Jaccard) measure how words are spelled, but often miss cases where the spelling differs while the sound is identical. phonemenal closes that gap with phoneme-level analysis grounded in the CMU Pronouncing Dictionary.

A key use case is phonetic typosquatting on package registries like PyPI, npm, and crates.io — where an attacker publishes numpie hoping developers will confuse it with numpy. But the same techniques apply anywhere you need to compare how words sound: brand-name screening, domain monitoring, search relevance, accessibility tooling, and more.

phonemenal provides:

  • Three complementary scoring algorithms grounded in computational phonology
  • CMU Pronouncing Dictionary integration for accurate phoneme-level comparison
  • A fast fallback encoder for brand names and neologisms not in any dictionary
  • Variant generation to proactively find attack candidates
  • Compound word splitting with homophone permutation recombination
  • Optional LLM-powered deep analysis for ambiguous cases

At a Glance

from phonemenal import similarity, homophones, variants, scanning

# Are these two names phonetically similar?
similarity.composite("numpy", "numpie")  # → 0.0 (not in CMU dict — use fallback)

# Find exact homophones
homophones.find("blue")  # → ["blew"]

# Generate attack candidates
variants.generate("flask")  # → {"phlask", "flazk", ...}

# Scan a batch of candidates against known packages
results = scanning.scan(
    candidates=["numpie", "requests2"],
    known_names=["numpy", "requests"],
)

Feature Highlights

Feature Module Description
PPC-A scoring similarity.ppc Positional phoneme correlation
PLD scoring similarity.pld Syllable-level edit distance
LCS scoring similarity.lcs Longest common subsequence on phonemes
Composite similarity.composite Weighted average of all three
Exact homophones homophones.find CMU dict inversion lookup
Near-homophones homophones.find_similar Threshold-based fuzzy search
Variant generation variants.generate Phonetic substitution patterns
Compound splitting splitting.split ML-based word segmentation
Fallback encoder fallback.phonetic_key Works without CMU dict
Batch scanning scanning.scan Forward + reverse collision detection
LLM analysis llm.analyze Deep analysis via Anthropic/OpenAI
Full CLI phonemenal Rich terminal interface

Background and Research

phonemenal stems from previous research on homophonic collisions conducted by Reagan Short and Justin Ibarra. Their TROOPERS 2023 talk — Homophonic Collisions: Hold me closer Tony Danza — covers the problem space and the approach to phonetic similarity detection that inspired this library.

Coils of Communication Chaos