zi2zi-JiT is a conditional variant of JiT (Just image Transformer) designed for Chinese font style transfer. Given a source character and a style reference, it synthesizes the character in the target font style.
The architecture, illustrated above, extends the base JiT model with three components:
- Content Encoder — a CNN that captures the structural layout of the input character, adapted from FontDiffuser.
- Style Encoder — a CNN that extracts stylistic features from a reference glyph in the target font.
- Multi-Source In-Context Mixing — instead of conditioning on a single category token as in the original JiT, font, style, and content embeddings are concatenated into a unified conditioning sequence.
Two model variants are available — JiT-B/16 and JiT-L/16 — both trained for 2,000 epochs on a corpus of over 400+ fonts (70% simplified Chinese, 20% traditional Chinese, 10% Japanese), totalling 300k+ character images. For each font, the max number of characters used for training is capped at 800
Generated glyphs are evaluated against ground-truth references following the protocol in FontDiffuser. All metrics are computed over 2,400 pairs.
| Model | FID ↓ | SSIM ↑ | LPIPS ↓ | L1 ↓ |
|---|---|---|---|---|
| JiT-B/16 | 53.81 | 0.6753 | 0.2024 | 0.1071 |
| JiT-L/16 | 56.01 | 0.6794 | 0.1967 | 0.1043 |
conda env create -f environment.yaml
conda activate zi2zi-jit
pip install -e .Pretrained checkpoints are available on Google Drive:
Save desired checkpoint and place it under models/:
mkdir -p models
# zi2zi-JiT-B-16.pth (Base variant)
# zi2zi-JiT-L-16.pth (Large variant)Generate a paired dataset from a source font and a directory of target fonts:
python scripts/generate_font_dataset.py \
--source-font data/思源宋体light.otf \
--font-dir data/sample_single_font \
--output-dir data/sample_datasetThis produces the following structure:
data/sample_dataset/
├── train/
│ ├── 001_FontA/
│ │ ├── 00000_U+XXXX.jpg
│ │ ├── 00001_U+XXXX.jpg
│ │ ├── ...
│ │ └── metadata.json
│ ├── 002_FontB/
│ │ └── ...
│ └── ...
├── test/
│ ├── 001_FontA/
│ │ └── ...
│ └── ...
└── test.npz
Each .jpg is a 1024x256 composite: source (256) | target (256) | ref_grid_1 (256) | ref_grid_2 (256).
Alternatively, build a dataset from a directory of rendered character images. Each file should be a 256x256 PNG named by its character:
data/sample_glyphs/
├── 万.png
├── 上.png
├── 中.png
├── 人.png
├── 大.png
└── ...
python scripts/generate_glyph_dataset.py \
--source-font data/思源宋体light.otf \
--glyph-dir data/sample_glyphs \
--output-dir data/sample_glyph_dataset \
--train-count 200Fine-tune a pretrained model on a single GPU with LoRA. Fine-tuning a single font typically takes less than one hour on a single H100. The example below uses JiT-B/16 with batch size 16, which requires roughly 4 GB of VRAM:
python lora_single_gpu_finetune_jit.py \
--data_path data/sample_dataset/train/ \
--test_npz_path data/sample_dataset/test.npz \
--output_dir run/lora_ft_sample_single/ \
--base_checkpoint models/zi2zi-JiT-B-16.pth \
--model JiT-B/16 \
--num_fonts 1000 \
--num_chars 20000 \
--max_chars_per_font 200 \
--img_size 256 \
--lora_r 32 \
--lora_alpha 32 \
--lora_targets "qkv,proj,w12,w3" \
--epochs 200 \
--batch_size 16 \
--blr 8e-4 \
--warmup_epochs 1 \
--save_last_freq 10 \
--proj_dropout 0.1 \
--P_mean -0.8 \
--P_std 0.8 \
--noise_scale 1.0 \
--cfg 2.6 \
--online_eval \
--eval_step_folders \
--eval_freq 10 \
--gen_bsz 16 \
--num_images 400 \
--seed 42Key parameters:
| Parameter | Note |
|---|---|
--num_fonts, --num_chars |
Tied to the pretrained model's embedding size. Do not change unless pretraining from scratch. |
--max_chars_per_font |
Caps the number of characters used from each font. |
--lora_r, --lora_alpha |
LoRA capacity. Higher values give more capacity at the cost of memory. |
--batch_size |
16 uses ~4 GB VRAM. |
--cfg |
Conditioning strength. Use 2.6 for JiT-B/16, 2.4 for JiT-L/16. |
Generate characters from a fine-tuned checkpoint:
python generate_chars.py \
--checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
--test_npz data/sample_dataset/test.npz \
--output_dir run/generated_chars/Notes for generate_chars.py:
- Supported samplers are
euler,heun, andab2. - If
--num_sampling_stepsis not set, the script uses method-specific defaults:euler -> 20,heun -> 50,ab2 -> 20. - If neither
--sampling_methodnor--num_sampling_stepsis overridden, the script keeps the checkpoint's saved inference settings. - Current recommended fast setting:
--sampling_method ab2 --cfg 2.6and let the default20steps apply. heun-50is kept as a conservative legacy/reference baseline. In the current 50-sample MPS benchmark,ab2-20andeuler-20were both faster and scored better thanheun-50on SSIM, LPIPS, and L1.
Example fast generation command:
python generate_chars.py \
--checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
--test_npz data/sample_dataset/test.npz \
--output_dir run/generated_chars_ab2/ \
--sampling_method ab2Compute pairwise metrics (SSIM, LPIPS, L1, FID) on the generated comparison grids:
python scripts/compute_pairwise_metrics.py \
--device cuda \
run/lora_ft_sample_single/heun-steps50-cfg2.6-interval0.0-1.0-image400-res256/step_10/compare/Fonts created with zi2zi-JiT:
Ground truth on the left, generated one on the right
Code is licensed under MIT. Generated font outputs are additionally subject to the "Font Artifact License Addendum" in LICENSE:
- commercial use is allowed
- attribution is required when distributing a font product that uses more than 200 characters created from repository artifacts
This project builds on code and ideas from:
- FontDiffuser — content/style encoder design and evaluation protocol
- JiT — base diffusion transformer architecture
@article{zi2zi-jit,
title = {zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers},
author = {Yuchen Tian},
year = {2026},
url = {https://github.com/kaonashi-tyc/zi2zi-jit}
}
