CJK语言的高保真字体合成

CJK语言的高保真字体合成
High fidelity font synthesis for CJK languages

原始链接: https://github.com/kaonashi-tyc/zi2zi-JiT

## zi2zi-JiT：中文字体风格迁移 zi2zi-JiT是一个基于JiT架构的扩散Transformer模型，用于合成中文字体。它将参考字形的风格转移到源字符上，从而实现字体风格的修改。该模型利用内容编码器（来自FontDiffuser）捕捉字符结构，风格编码器提取风格特征，并采用多源上下文混合方法进行条件设置。 JiT-B/16和JiT-L/16两个变体是在包含400多个字体的超过30万个字符图像的大型数据集上训练的（主要为简体和繁体中文，以及少量日语）。评估指标（FID、SSIM、LPIPS、L1）显示出强大的性能。该项目提供了数据集创建、微调（使用LoRA在单个GPU上 – 约4GB VRAM）和字符生成工具。微调单个字体可以在一小时内完成。预训练检查点可用，如果分发包含来自该项目派生的超过200个字符的产品，则需要署名。代码采用MIT许可，并针对字体输出添加了特定条款。

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交登录 CJK 语言的高保真字体合成 (github.com/kaonashi-tyc) 7 分，来自 kaonashi-tyc-01 2 小时前 | 隐藏 | 过去的 | 收藏 | 1 条评论帮助 kaonashi-tyc-01 2 小时前 [–] 在原始 zi2zi 的基础上进行后续工作，现在使用 transformer 作为主干。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers

中文版

zi2zi-JiT is a conditional variant of JiT (Just image Transformer) designed for Chinese font style transfer. Given a source character and a style reference, it synthesizes the character in the target font style.

The architecture, illustrated above, extends the base JiT model with three components:

Content Encoder — a CNN that captures the structural layout of the input character, adapted from FontDiffuser.
Style Encoder — a CNN that extracts stylistic features from a reference glyph in the target font.
Multi-Source In-Context Mixing — instead of conditioning on a single category token as in the original JiT, font, style, and content embeddings are concatenated into a unified conditioning sequence.

Two model variants are available — JiT-B/16 and JiT-L/16 — both trained for 2,000 epochs on a corpus of over 400+ fonts (70% simplified Chinese, 20% traditional Chinese, 10% Japanese), totalling 300k+ character images. For each font, the max number of characters used for training is capped at 800

Generated glyphs are evaluated against ground-truth references following the protocol in FontDiffuser. All metrics are computed over 2,400 pairs.

Model	FID ↓	SSIM ↑	LPIPS ↓	L1 ↓
JiT-B/16	53.81	0.6753	0.2024	0.1071
JiT-L/16	56.01	0.6794	0.1967	0.1043

conda env create -f environment.yaml
conda activate zi2zi-jit
pip install -e .

Pretrained checkpoints are available on Google Drive:

Download Models

Save desired checkpoint and place it under models/:

mkdir -p models
# zi2zi-JiT-B-16.pth  (Base variant)
# zi2zi-JiT-L-16.pth  (Large variant)

Generate a paired dataset from a source font and a directory of target fonts:

python scripts/generate_font_dataset.py \
    --source-font data/思源宋体light.otf \
    --font-dir   data/sample_single_font \
    --output-dir data/sample_dataset

This produces the following structure:

data/sample_dataset/
├── train/
│   ├── 001_FontA/
│   │   ├── 00000_U+XXXX.jpg
│   │   ├── 00001_U+XXXX.jpg
│   │   ├── ...
│   │   └── metadata.json
│   ├── 002_FontB/
│   │   └── ...
│   └── ...
├── test/
│   ├── 001_FontA/
│   │   └── ...
│   └── ...
└── test.npz

Each .jpg is a 1024x256 composite: source (256) | target (256) | ref_grid_1 (256) | ref_grid_2 (256).

From rendered glyph images

Alternatively, build a dataset from a directory of rendered character images. Each file should be a 256x256 PNG named by its character:

data/sample_glyphs/
├── 万.png
├── 上.png
├── 中.png
├── 人.png
├── 大.png
└── ...

python scripts/generate_glyph_dataset.py \
    --source-font data/思源宋体light.otf \
    --glyph-dir   data/sample_glyphs \
    --output-dir  data/sample_glyph_dataset \
    --train-count 200

Fine-tune a pretrained model on a single GPU with LoRA. Fine-tuning a single font typically takes less than one hour on a single H100. The example below uses JiT-B/16 with batch size 16, which requires roughly 4 GB of VRAM:

python lora_single_gpu_finetune_jit.py \
    --data_path       data/sample_dataset/train/ \
    --test_npz_path   data/sample_dataset/test.npz \
    --output_dir      run/lora_ft_sample_single/ \
    --base_checkpoint models/zi2zi-JiT-B-16.pth \
    --model           JiT-B/16 \
    --num_fonts       1000 \
    --num_chars       20000 \
    --max_chars_per_font 200 \
    --img_size        256 \
    --lora_r          32 \
    --lora_alpha      32 \
    --lora_targets    "qkv,proj,w12,w3" \
    --epochs          200 \
    --batch_size      16 \
    --blr             8e-4 \
    --warmup_epochs   1 \
    --save_last_freq  10 \
    --proj_dropout    0.1 \
    --P_mean          -0.8 \
    --P_std           0.8 \
    --noise_scale     1.0 \
    --cfg             2.6 \
    --online_eval \
    --eval_step_folders \
    --eval_freq       10 \
    --gen_bsz         16 \
    --num_images      400 \
    --seed            42

Key parameters:

Parameter	Note
`--num_fonts`, `--num_chars`	Tied to the pretrained model's embedding size. Do not change unless pretraining from scratch.
`--max_chars_per_font`	Caps the number of characters used from each font.
`--lora_r`, `--lora_alpha`	LoRA capacity. Higher values give more capacity at the cost of memory.
`--batch_size`	16 uses ~4 GB VRAM.
`--cfg`	Conditioning strength. Use 2.6 for JiT-B/16, 2.4 for JiT-L/16.

Generate characters from a fine-tuned checkpoint:

python generate_chars.py \
    --checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
    --test_npz   data/sample_dataset/test.npz \
    --output_dir run/generated_chars/

Notes for generate_chars.py:

Supported samplers are euler, heun, and ab2.
If --num_sampling_steps is not set, the script uses method-specific defaults: euler -> 20, heun -> 50, ab2 -> 20.
If neither --sampling_method nor --num_sampling_steps is overridden, the script keeps the checkpoint's saved inference settings.
Current recommended fast setting: --sampling_method ab2 --cfg 2.6 and let the default 20 steps apply.
heun-50 is kept as a conservative legacy/reference baseline. In the current 50-sample MPS benchmark, ab2-20 and euler-20 were both faster and scored better than heun-50 on SSIM, LPIPS, and L1.

Example fast generation command:

python generate_chars.py \
    --checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
    --test_npz   data/sample_dataset/test.npz \
    --output_dir run/generated_chars_ab2/ \
    --sampling_method ab2

Compute pairwise metrics (SSIM, LPIPS, L1, FID) on the generated comparison grids:

python scripts/compute_pairwise_metrics.py \
    --device cuda \
    run/lora_ft_sample_single/heun-steps50-cfg2.6-interval0.0-1.0-image400-res256/step_10/compare/

Fonts created with zi2zi-JiT:

Ground truth on the left, generated one on the right

Code is licensed under MIT. Generated font outputs are additionally subject to the "Font Artifact License Addendum" in LICENSE:

commercial use is allowed
attribution is required when distributing a font product that uses more than 200 characters created from repository artifacts

This project builds on code and ideas from:

FontDiffuser — content/style encoder design and evaluation protocol
JiT — base diffusion transformer architecture

@article{zi2zi-jit,
  title   = {zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers},
  author  = {Yuchen Tian},
  year    = {2026},
  url     = {https://github.com/kaonashi-tyc/zi2zi-jit}
}