If you’ve ever built a login system, you’ve probably dealt with homoglyph attacks: someone registers аdmin with a Cyrillic “а” (U+0430) instead of Latin “a” (U+0061). The characters are visually identical, and if your system accepts Unicode identifiers, you have an impersonation vector.
The Unicode Consortium maintains an official defence against this: confusables.txt, part of Unicode Technical Standard #39 (Security Mechanisms). It’s a flat file mapping ~6,565 characters to their visual equivalents. Cyrillic а → a, Greek ο → o, Cherokee Ꭺ → A, and thousands more.
It’s worth noting that confusables.txt is designed for detection, not normalization. TR39 itself says skeleton mappings are “not suitable for display to users” and “should definitely not be used as a normalization of identifiers.” The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it — not to silently remap those characters and let it through.
Here’s the wrinkle. If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it), then 31 entries in confusables.txt map the same character to a different target than NFKC. If you’re building a confusable map for use after NFKC normalization, those entries are unreachable. NFKC has already transformed the character before your confusable check sees it.
What NFKC normalization does
NFKC (Normalization Form Compatibility Composition) is Unicode’s way of collapsing “compatibility variants” to their canonical form. Fullwidth letters → ASCII, superscripts → normal digits, ligatures → component letters, mathematical styled characters → plain characters:
Hello → Hello (fullwidth → ASCII)
finance → finance (fi ligature → fi)
𝐇ello → Hello (mathematical bold → plain)
This is the right first step for slug/handle validation. You want Hello to become hello, not to be rejected as containing confusable characters. NFKC handles hundreds of these compatibility forms automatically.
NFKC and confusables serve different purposes. NFKC is for normalization: producing a canonical form for storage and comparison. Confusables detection is for security: flagging characters that could fool a human reader. They answer different questions about the same input, and in a well-designed system they’re applied separately rather than chained together to produce a single output.
The conflict
Here’s what nobody seems to talk about: confusables.txt and NFKC sometimes map the same character to different Latin letters.
The classic example is the Long S (ſ, U+017F). This is the archaic letterform you see in 18th-century printing, where “Congress” was printed as “Congreſs.”
- confusables.txt maps ſ → f (visually, ſ does look like f)
- NFKC normalization maps ſ → s (linguistically, ſ is s)
Both are defensible mappings, but they answer different questions. TR39 asks “what does this look like?” NFKC asks “what does this mean?”
Why does this matter? If you normalize with NFKC first (converting ſ to s), then check the confusable map, the ſ→f entry never fires. NFKC already handled the character. Without NFKC, the confusable entry is correct as visual detection: ſ genuinely looks like f, and flagging it is the right call for security. But if you’re building a filtered confusable map for use downstream of NFKC (as namespace-guard does), these entries are dead code and should be removed to keep the map clean.
The full list: 31 entries
This isn’t a single edge case. I found 31 characters where confusables.txt and NFKC disagree:
The Long S
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| ſ | Latin Small Letter Long S | U+017F | f | s |
TR39 sees the visual resemblance to f. But linguistically (and in NFKC), ſ is an archaic form of s. The NFKC mapping is unambiguously correct for any application that cares about meaning rather than just shape.
Capital I → l (16 variants)
confusables.txt maps capital I (and all its styled variants) to lowercase l. This is the classic Il1 ambiguity: in many sans-serif fonts, uppercase I, lowercase l, and digit 1 are nearly indistinguishable.
NFKC normalizes styled variants back to plain I (U+0049), a different character from the confusable target l (U+006C):
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| ℐ | Script Capital I | U+2110 | l | I |
| ℑ | Fraktur Capital I | U+2111 | l | I |
| Ⅰ | Roman Numeral One | U+2160 | l | I |
| I | Fullwidth Latin Capital I | U+FF29 | l | I |
| 𝐈 | Mathematical Bold Capital I | U+1D408 | l | I |
| 𝐼 | Mathematical Italic Capital I | U+1D43C | l | I |
| 𝑰 | Mathematical Bold Italic Capital I | U+1D470 | l | I |
| 𝓘 | Mathematical Script Capital I (Bold) | U+1D4D8 | l | I |
| 𝕀 | Mathematical Double-Struck Capital I | U+1D540 | l | I |
| 𝕴 | Mathematical Fraktur Capital I (Bold) | U+1D574 | l | I |
| 𝖨 | Mathematical Sans-Serif Capital I | U+1D5A8 | l | I |
| 𝗜 | Mathematical Sans-Serif Bold Capital I | U+1D5DC | l | I |
| 𝘐 | Mathematical Sans-Serif Italic Capital I | U+1D610 | l | I |
| 𝙄 | Mathematical Sans-Serif Bold Italic Capital I | U+1D644 | l | I |
| 𝙸 | Mathematical Monospace Capital I | U+1D678 | l | I |
| | Outlined Latin Capital Letter I | U+1CCDE | l | I |
TR39 says all of these look like “l”. It’s right: they often do, especially in sans-serif fonts. NFKC normalizes them all to plain I (U+0049). If your system runs NFKC before confusable detection, the confusable entry for these characters is unreachable. NFKC has already transformed them to plain I, which won’t match the original source character in your confusable map.
Digit 0 → O (7 variants)
Same pattern with digit zero. confusables.txt maps styled zeros to the letter O (visually similar), but NFKC collapses them to the digit “0”:
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| 𝟎 | Mathematical Bold Digit Zero | U+1D7CE | O | 0 |
| 𝟘 | Mathematical Double-Struck Digit Zero | U+1D7D8 | O | 0 |
| 𝟢 | Mathematical Sans-Serif Digit Zero | U+1D7E2 | O | 0 |
| 𝟬 | Mathematical Sans-Serif Bold Digit Zero | U+1D7EC | O | 0 |
| 𝟶 | Mathematical Monospace Digit Zero | U+1D7F6 | O | 0 |
| 🯰 | Segmented Digit Zero | U+1FBF0 | O | 0 |
| | Outlined Digit Zero | U+1CCF0 | O | 0 |
NFKC correctly preserves the digit identity. Note that ASCII 0 (U+0030) itself has a confusable entry mapping to O, so the visual confusion between zero and O is caught regardless of whether NFKC runs first.
Digit 1 → l (7 variants)
And the same again with digit one, where confusables.txt sees “l” but NFKC correctly maps to “1”:
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| 𝟏 | Mathematical Bold Digit One | U+1D7CF | l | 1 |
| 𝟙 | Mathematical Double-Struck Digit One | U+1D7D9 | l | 1 |
| 𝟣 | Mathematical Sans-Serif Digit One | U+1D7E3 | l | 1 |
| 𝟭 | Mathematical Sans-Serif Bold Digit One | U+1D7ED | l | 1 |
| 𝟷 | Mathematical Monospace Digit One | U+1D7F7 | l | 1 |
| 🯱 | Segmented Digit One | U+1FBF1 | l | 1 |
| | Outlined Digit One | U+1CCF1 | l | 1 |
Why this happens
This isn’t a bug in either standard. TR39 and NFKC have different purposes, and they were designed independently:
confusables.txt answers: “What does this character visually resemble?” It’s designed for the skeleton algorithm, which compares two strings for visual similarity. Mathematical Bold I (𝐈) looks like lowercase l in most fonts. That’s a legitimate visual observation.
NFKC normalization answers: “What is the canonical form of this character?” Mathematical Bold I is semantically the letter I rendered in a bold mathematical style. NFKC strips the styling, yielding plain I.
These are orthogonal concerns. Confusability is about what humans see. NFKC is about what machines should store. Neither mapping is wrong; they answer different questions. But if you use both (which you should), it’s worth knowing where they diverge, especially if you’re building a filtered confusable map for use after NFKC.
The practical impact
If you build a confusable detection system and also run NFKC normalization, you need to know about these 31 entries:
If you run NFKC first, then check confusables: The 31 entries are unreachable. NFKC has already transformed the character before your confusable check sees it. They’re dead code in your detection map, not a security hole, but worth filtering out to keep the map clean.
If you check confusables without NFKC: These entries produce correct visual detection results. That’s what confusables.txt is designed for. ſ does look like f, styled zeros do look like O, and styled ones do look like l. The confusable map is doing its job. For zeros and ones specifically, ASCII 0 and 1 themselves have confusable entries mapping to O and l, so the visual confusion is caught regardless of whether NFKC runs first.
If you use confusables for remapping (don’t do this): The problems compound. teſt becomes teft instead of test. account10 with a mathematical 1 and 0 becomes accountlO. As TR39 states, confusable mappings should not be used as normalization.
What to do about it
The approach depends on how you use confusables:
If you use confusables for detection and rejection (recommended)
Filter your confusable map to exclude any character that NFKC already handles. This keeps your map clean and ensures every entry represents a character your system will actually encounter:
const sourceChar = String.fromCodePoint(sourceCp);
const nfkcResult = sourceChar.normalize("NFKC").toLowerCase();
// NFKC already maps to a Latin letter/digit - skip this entry
// (either same target = redundant, or different target = conflict)
if (/^[a-z0-9]$/.test(nfkcResult)) continue;
// NFKC produces a valid slug fragment - skip (already handled)
if (/^[a-z0-9-]+$/.test(nfkcResult)) continue;
// NFKC doesn't resolve to ASCII - keep this confusable entry
entries.push({ source: sourceCp, target: confusableTarget });
This takes you from ~6,565 raw TR39 entries to ~613 that are meaningful after NFKC. Every remaining entry is a character that survives NFKC unchanged and visually mimics a Latin letter.
In namespace-guard, this is how it works in practice: NFKC is applied first during normalization when storing and comparing slugs. The confusable map then runs on the normalized input as a completely separate validation step — a blocklist. If any character in the normalized slug matches the map, the slug is rejected. No remapping, no skeleton, no merged output. Just: “does this string contain a character that looks like a Latin letter but isn’t one? If yes, reject.”
If you run confusables without NFKC
The full confusables.txt map works as designed. These 31 entries encode correct visual judgments: ſ does look like f, styled zeros do look like O, styled ones do look like l. No filtering needed.
Making it reproducible
Rather than hand-curating a confusable map (which becomes stale when Unicode ships new versions), I wrote a generator script that:
- Downloads confusables.txt from unicode.org
- Extracts all single-character → Latin letter/digit mappings
- Filters out NFKC-redundant and NFKC-conflicting entries
- Adds supplemental mappings for known gaps (e.g., Latin small capitals that confusables.txt misses)
- Outputs a TypeScript object literal, grouped by Unicode block
The script prints stats to stderr so you can verify the filtering:
Filtered to 605 entries from TR39
Skipped 31 NFKC-conflict entries (NFKC maps to different Latin char)
Skipped 766 NFKC-handled entries (NFKC produces valid slug fragment)
Added 8 supplemental entries (Latin small capitals)
Total: 613 entries
When a new Unicode version ships, re-run the script and you get an updated map automatically filtered against the current runtime’s NFKC implementation. The exact counts depend on two things: the version of confusables.txt you download, and your runtime’s Unicode data tables (what String.prototype.normalize uses). The numbers in this post are from the current Unicode 16.0 data.
The broader lesson
Unicode is not one monolithic standard. It’s a collection of semi-independent specifications maintained by different working groups. UTR #15 (normalization) and UTS #39 (security) were designed for different use cases and don’t explicitly account for each other.
The 31 divergent entries aren’t a bug in either standard. confusables.txt mappings are visual judgments. NFKC mappings are semantic equivalences. Both are correct in their own context. If you build a confusable map for use after NFKC, knowing where they diverge lets you filter your map down to entries that will actually fire.
The NFKC-aware confusable map (613 entries, ~2.5 KB gzipped) ships as part of namespace-guard, a zero-dependency TypeScript library for slug/handle validation. The generator script is at scripts/generate-confusables.ts.
Thanks to ficiek, v4ss42, nemec, LousyBeggar, carrottread, medforddad, Herb_Derb, and DontBuyAwards on r/programming for feedback that shaped this revision.