Show HN: Autofit2 – 用于多语言文本分类的端到端流水线
Show HN: Autofit2 – End-to-end pipeline for multilingual text classification

原始链接: https://github.com/neospe/autofit2

此自动化流水线提供了一个可扩展的小样本(few-shot)文本分类框架,支持超过 50 种语言。该框架基于 SetFit 和 SBERT 嵌入构建,仅需数十个标记样本即可实现 95%–99% 的高精度。 主要功能包括: * **端到端自动化:** 整个工作流程(从数据预处理、微调到评估与部署)均通过单个 JSON 配置文件进行管理。 * **可复现性:** 系统会生成全面的模型卡片,包含二氧化碳排放追踪,并支持任务中断后恢复执行。 * **灵活部署:** 用户可根据需求配置模型:针对特定任务选择“基础(base)”模型、针对生产环境选择“全量(all)”模型、针对特定实验选择“自定义(custom)”模型,或针对性能评估选择“基准(benchmark)”模型。 * **数据处理:** `loader` 字段支持自定义数据接入,并内置了基于所选目标类型自动进行训练集/测试集划分的逻辑。 该流水线专为提高效率而设计,使研究人员和开发者能够快速部署多语言模型,同时确保透明度和性能指标的一致性。无论是进行情感分析还是内容审核的微调,该系统都能将复杂的 NLP 工作流程简化为精简、由配置驱动的过程。

Stefan 开源了 **Autofit2**,这是一个专为轻量级多语言文本分类设计的端到端流水线。该工具最初为自动化文本审核而开发,但用途广泛,适用于各种文档分类任务,并已在 20 多种语言中证明了其有效性。 Autofit2 基于 Sentence Transformers 构建,利用了 **SetFit**(一种少样本学习技术)。即使在训练数据有限的情况下,它也能保持高性能,同时在 CPU 上具备高吞吐量。 该流水线简化了工作流程:只需输入一个基础模型和一个 JSON 配置文件,即可生成可直接部署的 TorchServe 模型存档。此外,它还能自动创建详细的模型卡,内容包括: * 任务基准测试和自洽性测试。 * 微调过程中的预计二氧化碳排放量。 * 基于熵的偏差分析(由所包含的 50 种语言测试语料库支持)。 该项目与开发者自定义的 Sentence Transformers “EAR”(基于熵的注意力正则化)分支配合使用效果最佳。代码已在 GitHub 上发布,作者目前正在寻求社区反馈。
相关文章

原文

Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on setfit and SBERT embeddings.

  • Few-Shot Learning: High precision (95–99%) with a few dozen labeled examples.
  • Multilingual Support: Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.
  • Automated Pipeline: End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.
  • Reproducibility & Transparency: JSON-based configuration, model card generation, and CO₂ emission tracking.

1. Prepare Data Use dataload or implement a custom loader providing labeled examples.

2. Configure Create myproject.json specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.

3. Run

The pipeline supports resumable execution.

python train.py myproject.json

4. Output

  • Deployable model archive.
  • Generated model card (training details, intended use, performance metrics, bias evaluation).

myproject.json defines the training parameters. Its structure depends on the target type: Base Models (all) or Custom Models (custom).

{
  "<task-key>": {
    "<language-key>": {
      "base": {
        "model file": "<path>",          // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2")
        "model type": "<string>",          // e.g., "bert"
        "pretraining task": "<string>",  // e.g., "sentence similarity"
        "downstream task": "<string>"    // e.g., "binary text classification"
      },
      "targets": {
        "<id-key>": { ... }             // See Target Options below
      }
    }
  }
}

The "targets" dictionary supports three specific key types:

  1. all (Base Model)
    • Generates a full set of artifacts: model folder, archive, and card.
    • Model ID: Derived from the config filename ({config_name}-{task}-{lang}). The config filename must be stable.
  2. custom (Custom Model)
    • Generates a full set of artifacts: model folder, archive, and card.
    • Model ID: can be auto-generated as a 14–16 character lowercase alphanumeric string.
  3. benchmark 1..N (Benchmarking Only)
    • Does not generate model artifacts.
    • Outputs only score logs.
    • Must be used in conjunction with an all target to produce output.

Each entry in the "targets" dictionary supports the following keys:

Key Type Description
description string Free-form description of the target.
link string URL to source data or documentation.
train embedding bool Set to true to fine-tune embeddings during training.
base clf string ID string pointing to a .joblib file located in BASE_PATH. Must match exactly.
sample ratio float Random sample of total data for full training (e.g., 0.5 = 50%).
embedding sample ratio float Random sample of data used only for embedding fine-tuning (e.g., 0.1 = 10%).

The "loader" field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):

"loader": ["command_1", "command_2"]
  • Command Definition: Each command must return a list of dictionaries with keys text and label. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas).
  • Data Splitting Logic:
    • If 2 commands AND target != all:
      • Command 1 → Training Data
      • Command 2 → Evaluation Data
    • Else (Target = all):
      • All commands are concatenated into a single dataset.
      • Split: 100/100 (No split; entire set used for training).
    • Else (Other Targets, e.g., custom or benchmarks with 1 command):
      • All commands are concatenated into a single dataset.
      • Split: 70/30 (Train/Test).
{
  "mod": {
    "el": {
      "base": {
        "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
        "model type": "bert",
        "pretraining task": "sentence similarity",
        "downstream task": "binary text classification"
      },
      "targets": {
        "benchmark 1": {
          "description": "Pitenis et al. - Offensive Language Identification in Greek",
          "link": "https://arxiv.org/abs/2003.07459",
          "loader": [
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
          ]
        },
        "all": {
          "loader": [
            "el_offense20()"
          ]
        }
      }
    }
  }
}

Breakdown: Finetuning a Sentence Transformer

To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.

Base Model Setup

"base": {
  "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
  "model type": "bert",
  "pretraining task": "sentence similarity",
  "downstream task": "binary text classification"
}
  • Model file: Path to the pretrained transformer.
  • Model type: Architecture type (e.g., BERT).
  • Pretraining task: Original task the model was trained on.
  • Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).

Targets

You can specify multiple finetuning targets. Each target defines a dataset and training strategy.

  1. benchmark 1
"benchmark 1": {
  "description": "Pitenis et al. - Offensive Language Identification in Greek",
  "link": "https://arxiv.org/abs/2003.07459",
  "loader": [
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
  ]
}
  • Uses a train/test split for evaluation.
  • Based on a published benchmark dataset.
  1. all
"all": {
  "loader": ["el_offense20()"]
}
  • Uses the full dataset for training.
  • No explicit evaluation—this is for production-grade finetuning.
联系我们 contact @ memedata.com