Show HN: Autofit2 – 用于多语言文本分类的端到端流水线

Show HN: Autofit2 – 用于多语言文本分类的端到端流水线
Show HN: Autofit2 – End-to-end pipeline for multilingual text classification

原始链接: https://github.com/neospe/autofit2

此自动化流水线提供了一个可扩展的小样本（few-shot）文本分类框架，支持超过 50 种语言。该框架基于 SetFit 和 SBERT 嵌入构建，仅需数十个标记样本即可实现 95%–99% 的高精度。主要功能包括： * **端到端自动化：** 整个工作流程（从数据预处理、微调到评估与部署）均通过单个 JSON 配置文件进行管理。 * **可复现性：** 系统会生成全面的模型卡片，包含二氧化碳排放追踪，并支持任务中断后恢复执行。 * **灵活部署：** 用户可根据需求配置模型：针对特定任务选择“基础（base）”模型、针对生产环境选择“全量（all）”模型、针对特定实验选择“自定义（custom）”模型，或针对性能评估选择“基准（benchmark）”模型。 * **数据处理：** `loader` 字段支持自定义数据接入，并内置了基于所选目标类型自动进行训练集/测试集划分的逻辑。该流水线专为提高效率而设计，使研究人员和开发者能够快速部署多语言模型，同时确保透明度和性能指标的一致性。无论是进行情感分析还是内容审核的微调，该系统都能将复杂的 NLP 工作流程简化为精简、由配置驱动的过程。

Stefan 开源了 **Autofit2**，这是一个专为轻量级多语言文本分类设计的端到端流水线。该工具最初为自动化文本审核而开发，但用途广泛，适用于各种文档分类任务，并已在 20 多种语言中证明了其有效性。 Autofit2 基于 Sentence Transformers 构建，利用了 **SetFit**（一种少样本学习技术）。即使在训练数据有限的情况下，它也能保持高性能，同时在 CPU 上具备高吞吐量。该流水线简化了工作流程：只需输入一个基础模型和一个 JSON 配置文件，即可生成可直接部署的 TorchServe 模型存档。此外，它还能自动创建详细的模型卡，内容包括： * 任务基准测试和自洽性测试。 * 微调过程中的预计二氧化碳排放量。 * 基于熵的偏差分析（由所包含的 50 种语言测试语料库支持）。该项目与开发者自定义的 Sentence Transformers “EAR”（基于熵的注意力正则化）分支配合使用效果最佳。代码已在 GitHub 上发布，作者目前正在寻求社区反馈。

原文

Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on setfit and SBERT embeddings.

Few-Shot Learning: High precision (95–99%) with a few dozen labeled examples.
Multilingual Support: Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.
Automated Pipeline: End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.
Reproducibility & Transparency: JSON-based configuration, model card generation, and CO₂ emission tracking.

1. Prepare Data Use dataload or implement a custom loader providing labeled examples.

2. Configure Create myproject.json specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.

3. Run

The pipeline supports resumable execution.

python train.py myproject.json

4. Output

Deployable model archive.
Generated model card (training details, intended use, performance metrics, bias evaluation).

myproject.json defines the training parameters. Its structure depends on the target type: Base Models (all) or Custom Models (custom).

{
  "<task-key>": {
    "<language-key>": {
      "base": {
        "model file": "<path>",          // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2")
        "model type": "<string>",          // e.g., "bert"
        "pretraining task": "<string>",  // e.g., "sentence similarity"
        "downstream task": "<string>"    // e.g., "binary text classification"
      },
      "targets": {
        "<id-key>": { ... }             // See Target Options below
      }
    }
  }
}

The "targets" dictionary supports three specific key types:

all (Base Model)
- Generates a full set of artifacts: model folder, archive, and card.
- Model ID: Derived from the config filename ({config_name}-{task}-{lang}). The config filename must be stable.
custom (Custom Model)
- Generates a full set of artifacts: model folder, archive, and card.
- Model ID: can be auto-generated as a 14–16 character lowercase alphanumeric string.
benchmark 1..N (Benchmarking Only)
- Does not generate model artifacts.
- Outputs only score logs.
- Must be used in conjunction with an all target to produce output.

Each entry in the "targets" dictionary supports the following keys:

Key	Type	Description
`description`	`string`	Free-form description of the target.
`link`	`string`	URL to source data or documentation.
`train embedding`	`bool`	Set to `true` to fine-tune embeddings during training.
`base clf`	`string`	ID string pointing to a `.joblib` file located in `BASE_PATH`. Must match exactly.
`sample ratio`	`float`	Random sample of total data for full training (e.g., `0.5` = 50%).
`embedding sample ratio`	`float`	Random sample of data used only for embedding fine-tuning (e.g., `0.1` = 10%).

The "loader" field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):

"loader": ["command_1", "command_2"]

Command Definition: Each command must return a list of dictionaries with keys text and label. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas).
Data Splitting Logic:
- If 2 commands AND target != all:
  - Command 1 → Training Data
  - Command 2 → Evaluation Data
- Else (Target = all):
  - All commands are concatenated into a single dataset.
  - Split: 100/100 (No split; entire set used for training).
- Else (Other Targets, e.g., custom or benchmarks with 1 command):
  - All commands are concatenated into a single dataset.
  - Split: 70/30 (Train/Test).

{
  "mod": {
    "el": {
      "base": {
        "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
        "model type": "bert",
        "pretraining task": "sentence similarity",
        "downstream task": "binary text classification"
      },
      "targets": {
        "benchmark 1": {
          "description": "Pitenis et al. - Offensive Language Identification in Greek",
          "link": "https://arxiv.org/abs/2003.07459",
          "loader": [
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
          ]
        },
        "all": {
          "loader": [
            "el_offense20()"
          ]
        }
      }
    }
  }
}

Breakdown: Finetuning a Sentence Transformer

To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.

Base Model Setup

"base": {
  "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
  "model type": "bert",
  "pretraining task": "sentence similarity",
  "downstream task": "binary text classification"
}

Model file: Path to the pretrained transformer.
Model type: Architecture type (e.g., BERT).
Pretraining task: Original task the model was trained on.
Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).

Targets

You can specify multiple finetuning targets. Each target defines a dataset and training strategy.

benchmark 1

"benchmark 1": {
  "description": "Pitenis et al. - Offensive Language Identification in Greek",
  "link": "https://arxiv.org/abs/2003.07459",
  "loader": [
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
  ]
}

Uses a train/test split for evaluation.
Based on a published benchmark dataset.

all

"all": {
  "loader": ["el_offense20()"]
}

Uses the full dataset for training.
No explicit evaluation—this is for production-grade finetuning.

Show HN: Autofit2 – 用于多语言文本分类的端到端流水线 Show HN: Autofit2 – End-to-end pipeline for multilingual text classification

Breakdown: Finetuning a Sentence Transformer

Show HN: Autofit2 – 用于多语言文本分类的端到端流水线
Show HN: Autofit2 – End-to-end pipeline for multilingual text classification