To understand why every machine since Gutenberg has wrestled this script and mostly lost, you need one structural fact: Arabic is cursive always. There is no print-versus-handwriting distinction, no block letters. The letters connect in stone inscriptions, in manuscripts, in metal, on screens. Each letter therefore changes shape depending on its neighbours (an isolated form, an initial, a medial, a final), and six letters refuse to connect forward at all, which breaks words into joined clusters and gives the script its rhythm. The shapes are not costumes over some underlying "real" letter. The positional variation is the letter.
And the alphabet is bigger than Arabic the language. Persian extends it with four letters Arabic does not have (پ pe, چ che, ژ zhe, گ gaf) and uses two of the existing letters in subtly different forms (ی for the final yāʾ, ک for kaf). Urdu adds an aspirated do-chashmī he (ھ), a retroflex set (ٹ ڈ ڑ), and a hanging ye barree (ے), and writes most of its everyday text in Nastaʿlīq, which a Naskh-shaped font will produce as a phonetically correct but visually unrecognisable approximation. Sindhi has more again. Pashto, Kurdish, Uyghur, Kashmiri, and Punjabi each take the alphabet, add what their phonology requires, and ship. Any font that calls itself "Arabic" without consulting the Persian and Urdu communities will produce, for hundreds of millions of readers in Iran and South Asia, text that is technically rendered but functionally wrong: the kaf has the wrong terminal, the heh fuses where it shouldn't, the digits are from the wrong belt. The Noto Sans Arabic family ships separate sub-fonts to cover these (NotoNaskhArabic, NotoNastaliqUrdu, NotoSansArabicUI), and OS font fallback chains usually get it right. Usually.
| stored codepoint | isolated | initial | medial | final |
|---|---|---|---|---|
| U+0639 ʿAYN | ع | عـ | ـعـ | ـع |
| U+0647 HEH | ه | هـ | ـهـ | ـه |
One codepoint, four shapes, chosen at render time by the shaping engine. The medial heh and the isolated heh are, to an untrained eye, different letters; I have watched students of Arabic meet the medial heh in week three and file a complaint with management. A Latin font that ships 26 lowercase shapes needs no opinion about any of this. An Arabic font is wrong unless it has opinions about all of it.
The arrangement we eventually settled on, after decades of wrong answers, is this: the encoding stores the abstract letter, and the font supplies the shapes. Unicode gives you one codepoint for ʿayn; the font carries the four positional glyphs; a shaping engine applies the OpenType features (isol, init, medi, fina, plus rlig for the ligatures the script requires, plus mark and mkmk for stacking the vowel signs) at render time. An Arabic font is a small program. The text you store is its input, not its output. The word is performed fresh every time you look at it, like music from a score.
The cleanest way to feel this is to assemble a word one letter at a time and watch every prior letter renegotiate its shape as the next one arrives:
click letters to add them to the word, in the order they appear in the codepoint stream:
RENDERED BY THE SHAPING ENGINE
—
Try م, then ح, then م, then د: build the name Muḥammad. The first م drops into its initial form the moment you add the ح, the ح goes medial when the next م arrives, and so on through to the د, one of the six non-joining letters, which interrupts the flow and forces what would have been the third م into a final form. Four codepoints in storage, one continuous stroke on screen. None of it happens without a shaping engine; a PDF generator that lacks one will render the same four codepoints as four disconnected isolated forms.
The wrong answers are still in the standard, fossilised, and they make excellent souvenirs. Before shaping engines existed, the 8-bit code pages of the DOS and early Windows era encoded the shapes themselves: a separate character for initial ʿayn, medial ʿayn, and so on. Unicode, which promised round-trip compatibility with anything else, had to swallow those sets whole, and they live on at U+FB50 through U+FEFF under the name Arabic Presentation Forms: several hundred codepoints that no new document should ever contain and that PDF text extractors merrily emit to this day, which is one of the reasons searching an Arabic PDF so often fails in silence. The haystack is encoded as shapes and your needle is encoded as letters. My favourite resident of the block, and one of my favourite characters in all of Unicode, is U+FDFD, ﷽ : four-word invocation, bismillāh ar-raḥmān ar-raḥīm, as a single codepoint. A monument from the era when rendering was baked into the encoding because nobody trusted the renderer to do anything, preserved forever, like a fly in amber that recites.
This bites because the two encodings render identically and compare differently. The customer search bug I mentioned at the top of this article was, specifically, this:
And if you want to know what the world looks like when software skips all of this, the shaping engine, the bidi algorithm, the whole apparatus, you do not have to imagine it, because an enormous amount of software still skips all of it:
مرحبا بالعالم، هذا نص عربي
The text says "hello, world, this is Arabic text." Tick the box for the version every Arabic reader has met in the wild: on a shop sign, a boarding pass, a watermark, an old film title. Every letter drops into its isolated form and the line is laid out left to right, backwards. This is what a program produces when it draws characters one at a time and never consults a shaping engine: old Photoshop did it, matplotlib still does it out of the box, many of the PDF generators on npm do it, receipt printers do it with conviction. The standard Python workaround, arabic_reshaper plus python-bidi, fixes it by pre-baking the shaped forms into the string using that fossil block from the paragraph above.