Scanning for Collisions¶
The scanning module is phonemenal's high-level collision detection pipeline. It combines the phonetic encoder, similarity scoring, and variant generation into a workflow designed for batch phonetic comparison. Common applications include package registry security scanning, brand-name audits, and any scenario where you need to check a set of names for sound-alike collisions.
Concepts¶
Forward Scan¶
Check a list of candidate names against a list of known names:
"Does
numpiesound like any existing package?"
Reverse Scan¶
Generate variants of each candidate and check whether those variants exist somewhere (e.g., on PyPI):
"If someone publishes
numpy, what sound-alike names might already exist?"
Collision Types¶
| Type | Meaning |
|---|---|
exact_phonetic |
Identical phonetic key (fallback encoder) |
near_phonetic |
Above-threshold similarity score |
Building a Phonetic Index¶
Before scanning, build an index for fast exact-key lookups:
from phonemenal.scanning import build_phonetic_index, scan
known = ["numpy", "requests", "flask", "click", "rich"]
index = build_phonetic_index(known)
# {"nAmpY": ["numpy"], "rAkwAsts": ["requests"], ...}
The scan() and scan_with_reverse() functions build this index internally, but you can pre-build it if you're scanning multiple batches against the same known set.
Forward Scanning¶
from phonemenal.scanning import scan
candidates = ["numpie", "requsets", "phlask"]
known = ["numpy", "requests", "flask"]
matches = scan(
candidates=candidates,
known_names=known,
threshold=0.75, # minimum similarity score
)
for m in matches:
print(f"{m['candidate']} → {m['matched_name']} "
f"({m['similarity']:.2f}, {m['collision_type']})")
Threshold Tuning¶
| Threshold | Behavior |
|---|---|
0.60 |
Catches distant sound-alikes; more false positives |
0.75 |
Default — balanced precision/recall |
0.85 |
Strict; only very close matches |
1.00 |
Exact phonetic key match only |
Reverse Scanning¶
Reverse scanning generates variants of each candidate, then checks if those variants exist in the real world:
from phonemenal.scanning import scan_with_reverse
def check_pypi(name: str) -> bool:
"""Check if a package exists on PyPI."""
import httpx
resp = httpx.head(f"https://pypi.org/project/{name}/")
return resp.status_code == 200
matches = scan_with_reverse(
candidates=["numpy"],
known_names=["numpy"],
exists_fn=check_pypi, # called for each generated variant
include_morphological=True, # also try suffix swaps
threshold=0.75,
)
The reverse scan:
- Generates phonetic variants via
variants.generate() - Optionally generates morphological variants via
variants.generate_morphological() - Calls
exists_fn(variant)for each generated name - Reports any that exist and also score above threshold against the candidate
Composite Scoring Mode¶
By default, scanning uses the fast fallback encoder. For higher accuracy, enable CMU-dict-backed composite scoring:
matches = scan(
candidates=candidates,
known_names=known,
use_composite=True,
composite_weights=(1.0, 2.0, 1.0), # default: emphasize the edit channel
edit_mode="max", # or "length" for monosyllable-aware selection
)
In composite mode, scanning uses the same composite scorer as similarity.composite(): PPC-A, an edit channel, and LCS. By default that edit channel uses max(PLD, PED), which helps short near-homophones that PLD alone would undershoot.
The scanning APIs (check_collision, scan, and scan_with_reverse) also accept edit_mode. This only affects CMU-backed composite scoring; fallback-key scoring is unchanged.
Note
When one or both names are missing from CMU pronunciations, the scanner automatically falls back to the phonetic encoder. It does not treat a genuine composite score of 0.0 as a fallback case.
Formatting Results¶
from phonemenal.scanning import scan, format_matches
matches = scan(candidates, known)
print(format_matches(matches))
This prints a human-readable summary with severity indicators.
Checking a Single Candidate¶
For one-off checks (e.g., in a pre-publish hook):
from phonemenal.scanning import build_phonetic_index, check_collision
known = ["numpy", "requests", "flask"]
index = build_phonetic_index(known)
matches = check_collision(
candidate="numpie",
known_names=known,
phonetic_index=index,
threshold=0.75,
)
Integration Patterns¶
PyPI Pre-Publish Hook¶
def check_before_publish(package_name: str, known_packages: list[str]) -> bool:
"""Return True if the name is safe to publish."""
from phonemenal.scanning import scan
matches = scan([package_name], known_packages, threshold=0.80)
if matches:
print(f"WARNING: '{package_name}' sounds like:")
for m in matches:
print(f" - {m['matched_name']} (score: {m['similarity']:.2f})")
return False
return True
Batch Registry Audit¶
from phonemenal.scanning import scan
# Load all package names from your registry
all_packages = load_package_names()
# Check each package against every other
for i, candidate in enumerate(all_packages):
others = all_packages[:i] + all_packages[i+1:]
matches = scan([candidate], others, threshold=0.85)
if matches:
flag_for_review(candidate, matches)