CJK语言的高保真字体合成

CJK语言的高保真字体合成
High fidelity font synthesis for CJK languages

原始链接: https://github.com/kaonashi-tyc/zi2zi-JiT

## zi2zi-JiT：中文字体风格迁移 zi2zi-JiT是一个基于JiT架构的扩散Transformer模型，用于合成中文字体。它将参考字形的风格转移到源字符上，从而实现字体风格的修改。该模型利用内容编码器（来自FontDiffuser）捕捉字符结构，风格编码器提取风格特征，并采用多源上下文混合方法进行条件设置。 JiT-B/16和JiT-L/16两个变体是在包含400多个字体的超过30万个字符图像的大型数据集上训练的（主要为简体和繁体中文，以及少量日语）。评估指标（FID、SSIM、LPIPS、L1）显示出强大的性能。该项目提供了数据集创建、微调（使用LoRA在单个GPU上 – 约4GB VRAM）和字符生成工具。微调单个字体可以在一小时内完成。预训练检查点可用，如果分发包含来自该项目派生的超过200个字符的产品，则需要署名。代码采用MIT许可，并针对字体输出添加了特定条款。

## 使用 zi2zi-JiT 实现高质量中日韩字体合成开发者 kaonashi-tyc 在 **zi2zi-JiT** 项目上取得了进展，该项目旨在利用字体合成技术创建实用、生产级别的中日韩 (CJK) 字体。作者对现有的字体生成技术不满意，因此在原始 zi2zi 项目的基础上，利用 Transformer 架构来获得更好的结果。目前，已经从古代中文文本和书法中生成了两个完整的中文字体（每个字体包含 6,763 个字符，基于 GB2312），并且**可免费用于商业用途**。这些字体可在 GitHub 上找到：[https://github.com/kaonashi-tyc/Zi-QuanHengDuLiang](https://github.com/kaonashi-tyc/Zi-QuanHengDuLiang) 和 [https://github.com/kaonashi-tyc/Zi-XuanZongTi](https://github.com/kaonashi-tyc/Zi-XuanZongTi)。仍然存在挑战，尤其是在重现篆书等古代书写形式时，因为训练数据有限。作者欢迎反馈，以进一步完善该项目及其功能。该项目利用了风格迁移技术，恰如其分地命名为“zi2zi”，在中文中意为“字到字”。

原文

zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers

中文版

zi2zi-JiT is a conditional variant of JiT (Just image Transformer) designed for Chinese font style transfer. Given a source character and a style reference, it synthesizes the character in the target font style.

The architecture, illustrated above, extends the base JiT model with three components:

Content Encoder — a CNN that captures the structural layout of the input character, adapted from FontDiffuser.
Style Encoder — a CNN that extracts stylistic features from a reference glyph in the target font.
Multi-Source In-Context Mixing — instead of conditioning on a single category token as in the original JiT, font, style, and content embeddings are concatenated into a unified conditioning sequence.

Two model variants are available — JiT-B/16 and JiT-L/16 — both trained for 2,000 epochs on a corpus of over 400+ fonts (70% simplified Chinese, 20% traditional Chinese, 10% Japanese), totalling 300k+ character images. For each font, the max number of characters used for training is capped at 800

Generated glyphs are evaluated against ground-truth references following the protocol in FontDiffuser. All metrics are computed over 2,400 pairs.

Model	FID ↓	SSIM ↑	LPIPS ↓	L1 ↓
JiT-B/16	53.81	0.6753	0.2024	0.1071
JiT-L/16	56.01	0.6794	0.1967	0.1043

conda env create -f environment.yaml
conda activate zi2zi-jit
pip install -e .

Pretrained checkpoints are available on Google Drive:

Download Models

Save desired checkpoint and place it under models/:

mkdir -p models
# zi2zi-JiT-B-16.pth  (Base variant)
# zi2zi-JiT-L-16.pth  (Large variant)

Generate a paired dataset from a source font and a directory of target fonts:

python scripts/generate_font_dataset.py \
    --source-font data/思源宋体light.otf \
    --font-dir   data/sample_single_font \
    --output-dir data/sample_dataset

This produces the following structure:

data/sample_dataset/
├── train/
│   ├── 001_FontA/
│   │   ├── 00000_U+XXXX.jpg
│   │   ├── 00001_U+XXXX.jpg
│   │   ├── ...
│   │   └── metadata.json
│   ├── 002_FontB/
│   │   └── ...
│   └── ...
├── test/
│   ├── 001_FontA/
│   │   └── ...
│   └── ...
└── test.npz

Each .jpg is a 1024x256 composite: source (256) | target (256) | ref_grid_1 (256) | ref_grid_2 (256).

From rendered glyph images

Alternatively, build a dataset from a directory of rendered character images. Each file should be a 256x256 PNG named by its character:

data/sample_glyphs/
├── 万.png
├── 上.png
├── 中.png
├── 人.png
├── 大.png
└── ...

python scripts/generate_glyph_dataset.py \
    --source-font data/思源宋体light.otf \
    --glyph-dir   data/sample_glyphs \
    --output-dir  data/sample_glyph_dataset \
    --train-count 200

Fine-tune a pretrained model on a single GPU with LoRA. Fine-tuning a single font typically takes less than one hour on a single H100. The example below uses JiT-B/16 with batch size 16, which requires roughly 4 GB of VRAM:

python lora_single_gpu_finetune_jit.py \
    --data_path       data/sample_dataset/train/ \
    --test_npz_path   data/sample_dataset/test.npz \
    --output_dir      run/lora_ft_sample_single/ \
    --base_checkpoint models/zi2zi-JiT-B-16.pth \
    --model           JiT-B/16 \
    --num_fonts       1000 \
    --num_chars       20000 \
    --max_chars_per_font 200 \
    --img_size        256 \
    --lora_r          32 \
    --lora_alpha      32 \
    --lora_targets    "qkv,proj,w12,w3" \
    --epochs          200 \
    --batch_size      16 \
    --blr             8e-4 \
    --warmup_epochs   1 \
    --save_last_freq  10 \
    --proj_dropout    0.1 \
    --P_mean          -0.8 \
    --P_std           0.8 \
    --noise_scale     1.0 \
    --cfg             2.6 \
    --online_eval \
    --eval_step_folders \
    --eval_freq       10 \
    --gen_bsz         16 \
    --num_images      400 \
    --seed            42

Key parameters:

Parameter	Note
`--num_fonts`, `--num_chars`	Tied to the pretrained model's embedding size. Do not change unless pretraining from scratch.
`--max_chars_per_font`	Caps the number of characters used from each font.
`--lora_r`, `--lora_alpha`	LoRA capacity. Higher values give more capacity at the cost of memory.
`--batch_size`	16 uses ~4 GB VRAM.
`--cfg`	Conditioning strength. Use 2.6 for JiT-B/16, 2.4 for JiT-L/16.

Generate characters from a fine-tuned checkpoint:

python generate_chars.py \
    --checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
    --test_npz   data/sample_dataset/test.npz \
    --output_dir run/generated_chars/

Notes for generate_chars.py:

Supported samplers are euler, heun, and ab2.
If --num_sampling_steps is not set, the script uses method-specific defaults: euler -> 20, heun -> 50, ab2 -> 20.
If neither --sampling_method nor --num_sampling_steps is overridden, the script keeps the checkpoint's saved inference settings.
Current recommended fast setting: --sampling_method ab2 --cfg 2.6 and let the default 20 steps apply.
heun-50 is kept as a conservative legacy/reference baseline. In the current 50-sample MPS benchmark, ab2-20 and euler-20 were both faster and scored better than heun-50 on SSIM, LPIPS, and L1.

Example fast generation command:

python generate_chars.py \
    --checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
    --test_npz   data/sample_dataset/test.npz \
    --output_dir run/generated_chars_ab2/ \
    --sampling_method ab2

Compute pairwise metrics (SSIM, LPIPS, L1, FID) on the generated comparison grids:

python scripts/compute_pairwise_metrics.py \
    --device cuda \
    run/lora_ft_sample_single/heun-steps50-cfg2.6-interval0.0-1.0-image400-res256/step_10/compare/

Fonts created with zi2zi-JiT:

Ground truth on the left, generated one on the right

Code is licensed under MIT. Generated font outputs are additionally subject to the "Font Artifact License Addendum" in LICENSE:

commercial use is allowed
attribution is required when distributing a font product that uses more than 200 characters created from repository artifacts

This project builds on code and ideas from:

FontDiffuser — content/style encoder design and evaluation protocol
JiT — base diffusion transformer architecture

@article{zi2zi-jit,
  title   = {zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers},
  author  = {Yuchen Tian},
  year    = {2026},
  url     = {https://github.com/kaonashi-tyc/zi2zi-jit}
}