Scanning for Collisions¶
The scanning module is phonemenal's high-level collision detection pipeline. It combines the phonetic encoder, similarity scoring, and variant generation into a workflow designed for batch phonetic comparison. Common applications include package registry security scanning, brand-name audits, and any scenario where you need to check a set of names for sound-alike collisions.
Concepts¶
Forward Scan¶
Check a list of candidate names against a list of known names:
"Does
numpiesound like any existing package?"
Reverse Scan¶
Generate variants of each candidate and check whether those variants exist somewhere (e.g., on PyPI):
"If someone publishes
numpy, what sound-alike names might already exist?"
Collision Types¶
| Type | Meaning |
|---|---|
exact_phonetic |
Identical phonetic key (fallback encoder) |
near_phonetic |
Above-threshold similarity score |
Building a Phonetic Index¶
Before scanning, build an index for fast exact-key lookups:
from phonemenal.scanning import build_phonetic_index, scan
known = ["numpy", "requests", "flask", "click", "rich"]
index = build_phonetic_index(known)
# {"nAmpY": ["numpy"], "rAkwAsts": ["requests"], ...}
The scan() and scan_with_reverse() functions build this index internally, but you can pre-build it if you're scanning multiple batches against the same known set.
Forward Scanning¶
from phonemenal.scanning import scan
candidates = ["numpie", "requsets", "phlask"]
known = ["numpy", "requests", "flask"]
matches = scan(
candidates=candidates,
known_names=known,
threshold=0.75, # minimum similarity score
)
for m in matches:
print(f"{m['candidate']} → {m['matched_name']} "
f"({m['similarity']:.2f}, {m['collision_type']})")
Threshold Tuning¶
| Threshold | Behavior |
|---|---|
0.60 |
Catches distant sound-alikes; more false positives |
0.75 |
Default — balanced precision/recall |
0.85 |
Strict; only very close matches |
1.00 |
Exact phonetic key match only |
Reverse Scanning¶
Reverse scanning generates variants of each candidate, then checks if those variants exist in the real world:
from phonemenal.scanning import scan_with_reverse
def check_pypi(name: str) -> bool:
"""Check if a package exists on PyPI."""
import httpx
resp = httpx.head(f"https://pypi.org/project/{name}/")
return resp.status_code == 200
matches = scan_with_reverse(
candidates=["numpy"],
known_names=["numpy"],
exists_fn=check_pypi, # called for each generated variant
include_morphological=True, # also try suffix swaps
threshold=0.75,
)
The reverse scan:
- Generates phonetic variants via
variants.generate() - Optionally generates morphological variants via
variants.generate_morphological() - Calls
exists_fn(variant)for each generated name - Reports any that exist and also score above threshold against the candidate
Composite Scoring Mode¶
By default, scanning uses the fast fallback encoder. For higher accuracy, enable CMU-dict-backed composite scoring:
matches = scan(
candidates=candidates,
known_names=known,
use_composite=True,
composite_weights=(1.0, 1.5, 1.0), # emphasize PLD
)
Note
When composite scoring returns 0.0 (words not in CMU dict), the scanner automatically falls back to the phonetic encoder. You get the best of both worlds.
Formatting Results¶
from phonemenal.scanning import scan, format_matches
matches = scan(candidates, known)
print(format_matches(matches))
This prints a human-readable summary with severity indicators.
Checking a Single Candidate¶
For one-off checks (e.g., in a pre-publish hook):
from phonemenal.scanning import build_phonetic_index, check_collision
known = ["numpy", "requests", "flask"]
index = build_phonetic_index(known)
matches = check_collision(
candidate="numpie",
known_names=known,
phonetic_index=index,
threshold=0.75,
)
Integration Patterns¶
PyPI Pre-Publish Hook¶
def check_before_publish(package_name: str, known_packages: list[str]) -> bool:
"""Return True if the name is safe to publish."""
from phonemenal.scanning import scan
matches = scan([package_name], known_packages, threshold=0.80)
if matches:
print(f"WARNING: '{package_name}' sounds like:")
for m in matches:
print(f" - {m['matched_name']} (score: {m['similarity']:.2f})")
return False
return True
Batch Registry Audit¶
from phonemenal.scanning import scan
# Load all package names from your registry
all_packages = load_package_names()
# Check each package against every other
for i, candidate in enumerate(all_packages):
others = all_packages[:i] + all_packages[i+1:]
matches = scan([candidate], others, threshold=0.85)
if matches:
flag_for_review(candidate, matches)