phonemenal¶

Phonetic similarity and homophone detection library for Python — near-homophones, sound-alike collisions, and variant generation.

phonemenal is a general-purpose phonetic similarity library that answers the question: "do these two words sound the same?" It is used for typosquatting detection, supply-chain security audits, linguistics research, brand-name collision checks, and any workflow that compares how words sound.

Why phonemenal?¶

Traditional string-distance tools (Levenshtein, Jaccard) measure how words are spelled, but often miss cases where the spelling differs while the sound is identical. phonemenal closes that gap with phoneme-level analysis grounded in the CMU Pronouncing Dictionary.

A key use case is phonetic typosquatting on package registries like PyPI, npm, and crates.io — where an attacker publishes numpie hoping developers will confuse it with numpy. But the same techniques apply anywhere you need to compare how words sound: brand-name screening, domain monitoring, search relevance, accessibility tooling, and more.

phonemenal provides:

Three complementary scoring algorithms grounded in computational phonology
CMU Pronouncing Dictionary integration for accurate phoneme-level comparison
A fast fallback encoder for brand names and neologisms not in any dictionary
Variant generation to proactively find attack candidates
Compound word splitting with homophone permutation recombination
Optional LLM-powered deep analysis for ambiguous cases

At a Glance¶

from phonemenal import similarity, homophones, variants, scanning

# Are these two names phonetically similar?
similarity.composite("numpy", "numpie")  # → 0.0 (not in CMU dict — use fallback)

# Find exact homophones
homophones.find("blue")  # → ["blew"]

# Generate attack candidates
variants.generate("flask")  # → {"phlask", "flazk", ...}

# Scan a batch of candidates against known packages
results = scanning.scan(
    candidates=["numpie", "requests2"],
    known_names=["numpy", "requests"],
)

Feature Highlights¶

Feature	Module	Description
PPC-A scoring	`similarity.ppc`	Positional phoneme correlation
PLD scoring	`similarity.pld`	Syllable-level edit distance
LCS scoring	`similarity.lcs`	Longest common subsequence on phonemes
Composite	`similarity.composite`	Weighted average of all three
Exact homophones	`homophones.find`	CMU dict inversion lookup
Near-homophones	`homophones.find_similar`	Threshold-based fuzzy search
Variant generation	`variants.generate`	Phonetic substitution patterns
Compound splitting	`splitting.split`	ML-based word segmentation
Fallback encoder	`fallback.phonetic_key`	Works without CMU dict
Batch scanning	`scanning.scan`	Forward + reverse collision detection
LLM analysis	`llm.analyze`	Deep analysis via Anthropic/OpenAI
Full CLI	`phonemenal`	Rich terminal interface

Background and Research¶

phonemenal stems from previous research on homophonic collisions conducted by Reagan Short and Justin Ibarra. Their TROOPERS 2023 talk — Homophonic Collisions: Hold me closer Tony Danza — covers the problem space and the approach to phonetic similarity detection that inspired this library.

Coils of Communication Chaos

Quick Links¶

Installation — get up and running
Quick Start — first steps with the library and CLI
Algorithms — how the scoring works under the hood
CLI Reference — complete command-line usage
API Reference — full module documentation