Scanning for Collisions¶

The scanning module is phonemenal's high-level collision detection pipeline. It combines the phonetic encoder, similarity scoring, and variant generation into a workflow designed for batch phonetic comparison. Common applications include package registry security scanning, brand-name audits, and any scenario where you need to check a set of names for sound-alike collisions.

Concepts¶

Forward Scan¶

Check a list of candidate names against a list of known names:

"Does numpie sound like any existing package?"

Reverse Scan¶

Generate variants of each candidate and check whether those variants exist somewhere (e.g., on PyPI):

"If someone publishes numpy, what sound-alike names might already exist?"

Collision Types¶

Type	Meaning
`exact_phonetic`	Identical phonetic key (fallback encoder)
`near_phonetic`	Above-threshold similarity score

Building a Phonetic Index¶

Before scanning, build an index for fast exact-key lookups:

from phonemenal.scanning import build_phonetic_index, scan

known = ["numpy", "requests", "flask", "click", "rich"]
index = build_phonetic_index(known)
# {"nAmpY": ["numpy"], "rAkwAsts": ["requests"], ...}

The scan() and scan_with_reverse() functions build this index internally, but you can pre-build it if you're scanning multiple batches against the same known set.

Forward Scanning¶

from phonemenal.scanning import scan

candidates = ["numpie", "requsets", "phlask"]
known = ["numpy", "requests", "flask"]

matches = scan(
    candidates=candidates,
    known_names=known,
    threshold=0.75,          # minimum similarity score
)

for m in matches:
    print(f"{m['candidate']} → {m['matched_name']} "
          f"({m['similarity']:.2f}, {m['collision_type']})")

Threshold Tuning¶

Threshold	Behavior
`0.60`	Catches distant sound-alikes; more false positives
`0.75`	Default — balanced precision/recall
`0.85`	Strict; only very close matches
`1.00`	Exact phonetic key match only

Reverse Scanning¶

Reverse scanning generates variants of each candidate, then checks if those variants exist in the real world:

from phonemenal.scanning import scan_with_reverse

def check_pypi(name: str) -> bool:
    """Check if a package exists on PyPI."""
    import httpx
    resp = httpx.head(f"https://pypi.org/project/{name}/")
    return resp.status_code == 200

matches = scan_with_reverse(
    candidates=["numpy"],
    known_names=["numpy"],
    exists_fn=check_pypi,         # called for each generated variant
    include_morphological=True,   # also try suffix swaps
    threshold=0.75,
)

The reverse scan:

Generates phonetic variants via variants.generate()
Optionally generates morphological variants via variants.generate_morphological()
Calls exists_fn(variant) for each generated name
Reports any that exist and also score above threshold against the candidate

Composite Scoring Mode¶

By default, scanning uses the fast fallback encoder. For higher accuracy, enable CMU-dict-backed composite scoring:

matches = scan(
    candidates=candidates,
    known_names=known,
    use_composite=True,
    composite_weights=(1.0, 1.5, 1.0),  # emphasize PLD
)

Note

When composite scoring returns 0.0 (words not in CMU dict), the scanner automatically falls back to the phonetic encoder. You get the best of both worlds.

Formatting Results¶

from phonemenal.scanning import scan, format_matches

matches = scan(candidates, known)
print(format_matches(matches))

This prints a human-readable summary with severity indicators.

Checking a Single Candidate¶

For one-off checks (e.g., in a pre-publish hook):

from phonemenal.scanning import build_phonetic_index, check_collision

known = ["numpy", "requests", "flask"]
index = build_phonetic_index(known)

matches = check_collision(
    candidate="numpie",
    known_names=known,
    phonetic_index=index,
    threshold=0.75,
)

Integration Patterns¶

PyPI Pre-Publish Hook¶

def check_before_publish(package_name: str, known_packages: list[str]) -> bool:
    """Return True if the name is safe to publish."""
    from phonemenal.scanning import scan

    matches = scan([package_name], known_packages, threshold=0.80)
    if matches:
        print(f"WARNING: '{package_name}' sounds like:")
        for m in matches:
            print(f"  - {m['matched_name']} (score: {m['similarity']:.2f})")
        return False
    return True

Batch Registry Audit¶

from phonemenal.scanning import scan

# Load all package names from your registry
all_packages = load_package_names()

# Check each package against every other
for i, candidate in enumerate(all_packages):
    others = all_packages[:i] + all_packages[i+1:]
    matches = scan([candidate], others, threshold=0.85)
    if matches:
        flag_for_review(candidate, matches)