Skip to content

splitting

Compound word splitting and homophone permutation recombination.

phonemenal.splitting

Compound word splitting and homophone permutation recombination.

Given a concatenated word like "bluevoyage", split it into components ["blue", "voyage"], find homophones for each, and recombine all permutations ("blewvoyage", "bleuvoyage", etc.).

split(word: str) -> list[str]

Split a compound/concatenated word into component words.

Uses wordninja's ML-based word segmentation trained on Wikipedia word frequency data.

Parameters:

Name Type Description Default
word str

Input string (e.g. "bluevoyage").

required

Returns list of component words (e.g. ["blue", "voyage"]). Returns [word] if no meaningful split is found.

Source code in phonemenal/splitting.py
def split(word: str) -> list[str]:
    """Split a compound/concatenated word into component words.

    Uses wordninja's ML-based word segmentation trained on Wikipedia word
    frequency data.

    Args:
        word: Input string (e.g. "bluevoyage").

    Returns list of component words (e.g. ["blue", "voyage"]).
    Returns [word] if no meaningful split is found.
    """
    parts = wordninja.split(word.lower())
    if len(parts) <= 1:
        return [word.lower()]
    return parts

component_homophones(word: str) -> dict[str, list[str]]

Find exact homophones for each component of a compound word.

Returns dict mapping each component → list of its homophones. Components without homophones map to [component] (just themselves).

Source code in phonemenal/splitting.py
def component_homophones(word: str) -> dict[str, list[str]]:
    """Find exact homophones for each component of a compound word.

    Returns dict mapping each component → list of its homophones.
    Components without homophones map to [component] (just themselves).
    """
    parts = split(word)
    result: dict[str, list[str]] = {}

    for part in parts:
        pronunciations = get_phonemes(part)
        homophones: set[str] = set()
        for pron in pronunciations:
            matches = find_words_by_pronunciation(pron)
            homophones.update(matches)

        if homophones:
            result[part] = sorted(homophones)
        else:
            result[part] = [part]

    return result

homophone_permutations(word: str, *, include_original: bool = True, max_permutations: int = 100) -> list[str]

Generate all homophone recombinations of a compound word.

Splits the word, finds homophones for each component, and produces all combinations by joining them.

Parameters:

Name Type Description Default
word str

Input compound word.

required
include_original bool

Whether to include the original word in results.

True
max_permutations int

Cap on results to avoid combinatorial explosion.

100

Returns list of recombined strings.

Example

"bluevoyage" → ["bluevoyage", "blewvoyage", "bleuvoyage", ...]

Source code in phonemenal/splitting.py
def homophone_permutations(
    word: str,
    *,
    include_original: bool = True,
    max_permutations: int = 100,
) -> list[str]:
    """Generate all homophone recombinations of a compound word.

    Splits the word, finds homophones for each component, and produces
    all combinations by joining them.

    Args:
        word: Input compound word.
        include_original: Whether to include the original word in results.
        max_permutations: Cap on results to avoid combinatorial explosion.

    Returns list of recombined strings.

    Example:
        "bluevoyage" → ["bluevoyage", "blewvoyage", "bleuvoyage", ...]
    """
    parts = split(word)
    if len(parts) <= 1:
        return [word.lower()] if include_original else []

    comp_homophones = component_homophones(word)
    homophone_lists = [comp_homophones[part] for part in parts]

    # Calculate total permutations before generating
    total = 1
    for h_list in homophone_lists:
        total *= len(h_list)
        if total > max_permutations * 10:
            break

    results: list[str] = []
    original = word.lower()

    for combo in product(*homophone_lists):
        joined = "".join(combo)
        if not include_original and joined == original:
            continue
        results.append(joined)
        if len(results) >= max_permutations:
            break

    return results