phonemenal¶
Phonetic similarity and homophone detection library for Python — near-homophones, sound-alike collisions, and variant generation.
phonemenal is a general-purpose phonetic similarity library that answers the question: "do these two words sound the same?" It is used for typosquatting detection, supply-chain security audits, linguistics research, brand-name collision checks, and any workflow that compares how words sound.
Why phonemenal?¶
Traditional string-distance tools (Levenshtein, Jaccard) measure how words are spelled, but often miss cases where the spelling differs while the sound is identical. phonemenal closes that gap with phoneme-level analysis grounded in the CMU Pronouncing Dictionary.
A key use case is phonetic typosquatting on package registries like PyPI, npm, and crates.io — where an attacker publishes numpie hoping developers will confuse it with numpy. But the same techniques apply anywhere you need to compare how words sound: brand-name screening, domain monitoring, search relevance, accessibility tooling, and more.
phonemenal provides:
- Three complementary scoring algorithms grounded in computational phonology
- CMU Pronouncing Dictionary integration for accurate phoneme-level comparison
- A fast fallback encoder for brand names and neologisms not in any dictionary
- Variant generation to proactively find attack candidates
- Compound word splitting with homophone permutation recombination
- Optional LLM-powered deep analysis for ambiguous cases
At a Glance¶
from phonemenal import similarity, homophones, variants, scanning
# Are these two names phonetically similar?
similarity.composite("numpy", "numpie") # → 0.0 (not in CMU dict — use fallback)
# Find exact homophones
homophones.find("blue") # → ["blew"]
# Generate attack candidates
variants.generate("flask") # → {"phlask", "flazk", ...}
# Scan a batch of candidates against known packages
results = scanning.scan(
candidates=["numpie", "requests2"],
known_names=["numpy", "requests"],
)
Feature Highlights¶
| Feature | Module | Description |
|---|---|---|
| PPC-A scoring | similarity.ppc |
Positional phoneme correlation |
| PLD scoring | similarity.pld |
Syllable-level edit distance |
| LCS scoring | similarity.lcs |
Longest common subsequence on phonemes |
| Composite | similarity.composite |
Weighted average of all three |
| Exact homophones | homophones.find |
CMU dict inversion lookup |
| Near-homophones | homophones.find_similar |
Threshold-based fuzzy search |
| Variant generation | variants.generate |
Phonetic substitution patterns |
| Compound splitting | splitting.split |
ML-based word segmentation |
| Fallback encoder | fallback.phonetic_key |
Works without CMU dict |
| Batch scanning | scanning.scan |
Forward + reverse collision detection |
| LLM analysis | llm.analyze |
Deep analysis via Anthropic/OpenAI |
| Full CLI | phonemenal |
Rich terminal interface |
Background and Research¶
phonemenal stems from previous research on homophonic collisions conducted by Reagan Short and Justin Ibarra. Their TROOPERS 2023 talk — Homophonic Collisions: Hold me closer Tony Danza — covers the problem space and the approach to phonetic similarity detection that inspired this library.
Quick Links¶
- Installation — get up and running
- Quick Start — first steps with the library and CLI
- Algorithms — how the scoring works under the hood
- CLI Reference — complete command-line usage
- API Reference — full module documentation