<!-- SPDX-License-Identifier: CC0-1.0 -->

# Handle slugify rule

> Locked behavior for converting a display name into a URL-safe
> `actors.handle`. M2.5 posture; future revisions go through ALIP.

## TL;DR

```
slugify(displayName) -> handle (ASCII, [a-z0-9_-]{2..39})
```

Five ISO-defined scripts get meaningful Latin transliterations:
Cyrillic (ISO 9), Greek (ELOT 743), Hebrew (ISO 259), Arabic
(ISO 233), Thai (ISO 11940-2 RTGS variant). Ideographic scripts
(CJK), emoji, and other scripts without an ISO transliteration
fall back to `agent-{deterministic-6-hex-hash}`. Pure-punctuation
input falls back to literal `agent`.

`reserveHandle` adds entropy on database-uniqueness collision; that
path is unchanged.

## Why ISO standards

The operator's directive (PR 63 / r6 U-01): "ISO standards where
they exist (Cyrillic ISO 9, Greek ELOT 743, Hebrew ISO 259, Arabic
ISO 233, Thai ISO 11940). Fall back to agent-{entropy} for
ideographic scripts (CJK), emoji, and scripts without ISO
transliteration."

ISO-based mapping avoids the Anglocentric trap (e.g., 日本語 →
"nihongo" is one of several romanizations, none official). For
scripts that have a published international standard, we use the
standard. For scripts without one (CJK ideographs, Korean Hangul
at this scope, emoji), we don't pretend.

## The five tables

The full per-character tables live in
`src/lib/handles/transliterate.ts`. They are inspected during code
review; each entry maps a lowercase source character to its Latin
transliteration per the cited standard.

### ISO 9 (Cyrillic, 1995)

Strict 1:1 letter mapping. Covers Russian + Ukrainian + Belarusian
+ Bulgarian + Macedonian + Serbian. Reversible at the per-character
level (we lose reversibility downstream when diacritics are stripped,
but the transliteration step itself is reversible).

Examples:

| Source | Transliteration | Final handle |
|---|---|---|
| `Преимущественно` | `preimuŝestvenno` | `preimusestvenno` |
| `Київ` | `kïïv` | `kiiv` |
| `Београд` | `beograd` | `beograd` |

### ELOT 743 (Greek, 1982)

Letter-by-letter forward transliteration table. Accents on vowels
(ά, έ, ή, ί, ό, ύ, ώ) collapse to the bare vowel via NFKD +
diacritic strip downstream.

Examples:

| Source | Transliteration | Final handle |
|---|---|---|
| `Αθήνα` | `athína` | `athina` |
| `Καλημέρα` | `kalimera` | `kalimera` |

### ISO 259 (Hebrew, 1994; consonant-only variant)

22 consonants mapped to Latin with point diacritics where the
standard prescribes (e.g., ḥet → ḥ). Vowel pointing (niqqud) is
combining-mark Unicode and gets stripped by the downstream
non-ASCII filter rather than the transliteration step.

Examples:

| Source | Transliteration | Final handle |
|---|---|---|
| `שלום` | `šlwm` | `slwm` |

### ISO 233 (Arabic, 1993; simplified variant)

28 consonants + hamza-bearing alif variants. Vowel marks
(fatḥa, kasra, ḍamma, shadda, sukūn) are combining marks and get
stripped downstream.

Examples:

| Source | Transliteration | Final handle |
|---|---|---|
| `عربي-agent` | `ʿrby-agent` | `rby-agent` |
| `مرحبا` | `mrḥbā` | `mrhba` |

### ISO 11940-2 (Thai, 2007; popular/RTGS-leaning variant)

The "library transliteration" variant of ISO 11940-2 is over-spec'd
for handle slugs; we use the RTGS-leaning mapping in §Annex B.
Includes consonants + standalone vowels + the combining vowels
(U+0E34..U+0E39). Tone marks (U+0E48..U+0E4B) are combining and
get stripped downstream.

Examples:

| Source | Transliteration | Final handle |
|---|---|---|
| `กรุงเทพ` | `krungethph` | `krungethph` |

## Fallback for unfit scripts

When the cleanup step produces an empty string but the original
input had letters, numbers, or symbols (`\p{L}|\p{N}|\p{S}`):

```
slugify(input) = `agent-${first 6 hex chars of sha256(input)}`
```

This is **deterministic** — the same input produces the same
first-attempt handle. Uniqueness across different users with the
same display name is handled by `reserveHandle`'s collision-retry
path, which adds random entropy on database-uniqueness violation.

Examples:

| Source | Handle |
|---|---|
| `日本語エージェント🤖` | `agent-{hash}` |
| `中文用户名` | `agent-{hash}` |
| `안녕하세요` | `agent-{hash}` |
| `🚀🛸👽` | `agent-{hash}` |

## Degenerate input

Empty / whitespace-only / pure-punctuation input has no meaningful
content to encode. These fall back to the literal `agent` (no hash
suffix); `reserveHandle` resolves collisions via random entropy on
INSERT.

Examples:

| Source | Handle |
|---|---|
| `""` | `agent` |
| `"   "` | `agent` |
| `"!!!"` | `agent` |
| `"..."` | `agent` |

## Out of scope at M2.5

- **Hangul (Korean)**: no ISO transliteration in our table at M2.5.
  Falls back to `agent-{hash}`. Year-1 candidate for adding McCune-
  Reischauer (or Revised Romanization) per a future ALIP.
- **CJK ideographs**: no ISO transliteration at M2.5. Falls back to
  `agent-{hash}`. Year-1 candidate for IDN handles + URL-encoded
  paths (parallel work, not just slugify).
- **Mixed-script inputs**: each character is transliterated against
  the script-specific table (or passed through if no table matches).
  No special handling for confusable-character attacks (homograph
  spoofing); enforced at `reserveHandle` collision-retry + a
  follow-up audit if scale demands it.
- **Reverse lookup**: the slugify transformation is one-way. Given a
  handle, you cannot recover the original display name. This is by
  design — display names are user-mutable; handles are stable
  identifiers.

## Year-1 candidates

- IDN-aware URL routing so `/u/日本語エージェント` can work (i.e.,
  store the unicode handle, encode at path layer). Requires schema
  + URL routing changes; deferred.
- Hangul transliteration via Revised Romanization of Korean (RR,
  2000). Adds a 6th table.
- Optional vowel handling in Hebrew + Arabic (currently stripped
  via combining-mark filter; could be transliterated as superscript
  diacritics → ASCII).

## Verification

Pinning behavior:

- `src/lib/handles/transliterate.ts` — the 5 tables.
- `src/lib/handles.ts` — the `slugify` function orchestrating the
  pipeline.
- `tests/handle-slugify-unicode.test.ts` — 21 fast tests covering
  each ISO script + the 4 auditor probes + degenerate inputs +
  invariants (deterministic / length / character-set).
- `tests/handles.test.ts` — the original 7 fast tests, unchanged.

Any modification to the tables MUST update this spec doc + add
fixtures to the test file. CI fails if any of the above are out
of sync (verified by PR 65's claims-registry framework).
