Skip to content

Scanning for Collisions

The scanning module is phonemenal's high-level collision detection pipeline. It combines the phonetic encoder, similarity scoring, and variant generation into a workflow designed for batch phonetic comparison. Common applications include package registry security scanning, brand-name audits, and any scenario where you need to check a set of names for sound-alike collisions.

Concepts

Forward Scan

Check a list of candidate names against a list of known names:

"Does numpie sound like any existing package?"

Reverse Scan

Generate variants of each candidate and check whether those variants exist somewhere (e.g., on PyPI):

"If someone publishes numpy, what sound-alike names might already exist?"

Collision Types

Type Meaning
exact_phonetic Identical phonetic key (fallback encoder)
near_phonetic Above-threshold similarity score

Building a Phonetic Index

Before scanning, build an index for fast exact-key lookups:

from phonemenal.scanning import build_phonetic_index, scan

known = ["numpy", "requests", "flask", "click", "rich"]
index = build_phonetic_index(known)
# {"nAmpY": ["numpy"], "rAkwAsts": ["requests"], ...}

The scan() and scan_with_reverse() functions build this index internally, but you can pre-build it if you're scanning multiple batches against the same known set.


Forward Scanning

from phonemenal.scanning import scan

candidates = ["numpie", "requsets", "phlask"]
known = ["numpy", "requests", "flask"]

matches = scan(
    candidates=candidates,
    known_names=known,
    threshold=0.75,          # minimum similarity score
)

for m in matches:
    print(f"{m['candidate']}{m['matched_name']} "
          f"({m['similarity']:.2f}, {m['collision_type']})")

Threshold Tuning

Threshold Behavior
0.60 Catches distant sound-alikes; more false positives
0.75 Default — balanced precision/recall
0.85 Strict; only very close matches
1.00 Exact phonetic key match only

Reverse Scanning

Reverse scanning generates variants of each candidate, then checks if those variants exist in the real world:

from phonemenal.scanning import scan_with_reverse

def check_pypi(name: str) -> bool:
    """Check if a package exists on PyPI."""
    import httpx
    resp = httpx.head(f"https://pypi.org/project/{name}/")
    return resp.status_code == 200

matches = scan_with_reverse(
    candidates=["numpy"],
    known_names=["numpy"],
    exists_fn=check_pypi,         # called for each generated variant
    include_morphological=True,   # also try suffix swaps
    threshold=0.75,
)

The reverse scan:

  1. Generates phonetic variants via variants.generate()
  2. Optionally generates morphological variants via variants.generate_morphological()
  3. Calls exists_fn(variant) for each generated name
  4. Reports any that exist and also score above threshold against the candidate

Composite Scoring Mode

By default, scanning uses the fast fallback encoder. For higher accuracy, enable CMU-dict-backed composite scoring:

matches = scan(
    candidates=candidates,
    known_names=known,
    use_composite=True,
    composite_weights=(1.0, 1.5, 1.0),  # emphasize PLD
)

Note

When composite scoring returns 0.0 (words not in CMU dict), the scanner automatically falls back to the phonetic encoder. You get the best of both worlds.


Formatting Results

from phonemenal.scanning import scan, format_matches

matches = scan(candidates, known)
print(format_matches(matches))

This prints a human-readable summary with severity indicators.


Checking a Single Candidate

For one-off checks (e.g., in a pre-publish hook):

from phonemenal.scanning import build_phonetic_index, check_collision

known = ["numpy", "requests", "flask"]
index = build_phonetic_index(known)

matches = check_collision(
    candidate="numpie",
    known_names=known,
    phonetic_index=index,
    threshold=0.75,
)

Integration Patterns

PyPI Pre-Publish Hook

def check_before_publish(package_name: str, known_packages: list[str]) -> bool:
    """Return True if the name is safe to publish."""
    from phonemenal.scanning import scan

    matches = scan([package_name], known_packages, threshold=0.80)
    if matches:
        print(f"WARNING: '{package_name}' sounds like:")
        for m in matches:
            print(f"  - {m['matched_name']} (score: {m['similarity']:.2f})")
        return False
    return True

Batch Registry Audit

from phonemenal.scanning import scan

# Load all package names from your registry
all_packages = load_package_names()

# Check each package against every other
for i, candidate in enumerate(all_packages):
    others = all_packages[:i] + all_packages[i+1:]
    matches = scan([candidate], others, threshold=0.85)
    if matches:
        flag_for_review(candidate, matches)