scanning¶
High-level collision detection pipeline combining the fallback encoder, similarity scoring, and variant generation.
When use_composite=True, the scanning APIs support composite tuning via composite_weights and edit_mode.
phonemenal.scanning
¶
Scan candidate names against a set of known names for phonetic collisions.
This is the high-level scanning pipeline that combines the fallback phonetic encoder, similarity scoring, and variant generation into a complete workflow.
Two scan modes
- Forward: check candidates against known names (fast, daemon pipeline)
- Reverse: also generate variants of candidates and check if they exist in a provided lookup function (e.g. a registry check such as a PyPI HEAD request)
build_phonetic_index(names: list[str]) -> dict[str, list[str]]
¶
Build a mapping from phonetic key → list of names.
Used for fast exact-key lookups before falling back to pairwise scoring.
Source code in phonemenal/scanning.py
check_collision(candidate: str, known_names: list[str], phonetic_index: dict[str, list[str]], *, threshold: float = DEFAULT_THRESHOLD, use_composite: bool = False, composite_weights: tuple[float, float, float] = (1.0, 2.0, 1.0), edit_mode: str = 'max') -> list[dict]
¶
Check a candidate name for phonetic collisions against known names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidate
|
str
|
Name to check. |
required |
known_names
|
list[str]
|
List of known/legitimate names. |
required |
phonetic_index
|
dict[str, list[str]]
|
Pre-built index from build_phonetic_index(). |
required |
threshold
|
float
|
Minimum similarity score to flag (0.0–1.0). |
DEFAULT_THRESHOLD
|
use_composite
|
bool
|
If True, use CMU-dict-backed composite scoring (PPC + Edit + LCS) instead of fallback key similarity. Slower but more accurate. |
False
|
composite_weights
|
tuple[float, float, float]
|
Weights for composite scoring (ppc, edit, lcs). |
(1.0, 2.0, 1.0)
|
edit_mode
|
str
|
Composite edit-channel selector passed to similarity.composite(). |
'max'
|
Returns list of match dicts sorted by similarity descending
- candidate: input name
- matched_name: the known name
- similarity: 0.0–1.0 score
- candidate_key: phonetic key of candidate
- matched_key: phonetic key of match
- collision_type: "exact_phonetic" | "near_phonetic"
Source code in phonemenal/scanning.py
scan(candidates: list[str], known_names: list[str], *, threshold: float = DEFAULT_THRESHOLD, use_composite: bool = False, composite_weights: tuple[float, float, float] = (1.0, 2.0, 1.0), edit_mode: str = 'max') -> list[dict]
¶
Scan candidate names for phonetic collisions with known names.
Forward scan only — checks each candidate against the known set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidates
|
list[str]
|
Names to check. |
required |
known_names
|
list[str]
|
Known/legitimate names to compare against. |
required |
threshold
|
float
|
Minimum similarity score to flag. |
DEFAULT_THRESHOLD
|
use_composite
|
bool
|
Use CMU-dict-backed composite scoring. |
False
|
composite_weights
|
tuple[float, float, float]
|
Weights for composite scoring. |
(1.0, 2.0, 1.0)
|
edit_mode
|
str
|
Composite edit-channel selector passed to similarity.composite(). |
'max'
|
Returns list of all matches across all candidates.
Source code in phonemenal/scanning.py
scan_with_reverse(candidates: list[str], known_names: list[str], *, exists_fn: Optional[Callable[[str], bool]] = None, threshold: float = DEFAULT_THRESHOLD, use_composite: bool = False, composite_weights: tuple[float, float, float] = (1.0, 2.0, 1.0), edit_mode: str = 'max', include_morphological: bool = True) -> list[dict]
¶
Scan candidates with forward AND reverse checking.
Forward: candidate vs known names (same as scan()). Reverse: generate variants of each candidate, check if they exist via exists_fn, and score any that do against the candidate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidates
|
list[str]
|
Names to check. |
required |
known_names
|
list[str]
|
Known/legitimate names. |
required |
exists_fn
|
Optional[Callable[[str], bool]]
|
Callable that returns True if a name exists in an external system (e.g. a PyPI HEAD request). If None, reverse scanning is skipped. |
None
|
threshold
|
float
|
Minimum similarity score. |
DEFAULT_THRESHOLD
|
use_composite
|
bool
|
Use CMU-dict composite scoring. |
False
|
composite_weights
|
tuple[float, float, float]
|
Weights for composite scoring. |
(1.0, 2.0, 1.0)
|
edit_mode
|
str
|
Composite edit-channel selector passed to similarity.composite(). |
'max'
|
include_morphological
|
bool
|
Include morphological variants in reverse scan. |
True
|
Returns all matches (forward + reverse).
Source code in phonemenal/scanning.py
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 | |
format_matches(matches: list[dict]) -> str
¶
Format collision results for display.
Returns a human-readable string summarizing all matches.