用概念代数引导可解释语言模型

用概念代数引导可解释语言模型
Steering interpretable language models with concept algebra

原始链接: https://www.guidelabs.ai/post/steerling-steering-8b/

## Steerling-8B：通过概念代数实现直接控制姜 Nguyen 介绍了 Steerling-8B，一个80亿参数的语言模型，旨在实现直接、可控的文本生成，*无需*再训练或复杂的提示工程。与现有方法不同，Steerling-8B 利用一个“概念模块”——一种架构瓶颈，迫使预测通过人类可解释的概念——从而实现“概念代数”。这允许用户在*推理时*注入、抑制或组合概念，以引导模型的输出。研究展示了三个关键能力：**概念注入**（将中性提示重定向到特定领域）、**概念抑制**（从输出中移除不需要的概念）和**多概念引导**（组合概念以实现细致的控制）。在2000个样本上的评估表明，引导实现了高概念得分（0.783），同时保持了84%的基线文本质量，这通过0.997的调和平均数得到证明。这种方法不同于提示、微调和事后可解释性方法，由于模型本身具有可解释性和线性架构，因此提供了可靠且可组合的控制。 Steerling-8B 代表着一种转变，即设计用于细粒度控制的模型，为内容审核和专业辅助等应用开辟了可能性。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录用概念代数引导可解释的语言模型 (guidelabs.ai) 10 分，luulinh90s 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 giang_at_glai 33分钟前 [–] 作者在此。这篇文章展示了语言模型上的“概念代数”：在推理时注入、抑制和组合人类可理解的概念（无需再训练，无需提示工程）。文章中有一个交互式演示。希望获得关于以下方面的反馈： (1) 您会选择哪些引导任务进行基准测试， (2) 您想看到哪些失败案例， (3) 这种组合控制是否有助于实际产品。相关链接：https://news.ycombinator.com/item?id=47131225 回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式搜索：

原文

Author: Giang Nguyen, Research Scientist

Published: February 25, 2026

We show that Steerling-8B enables concept algebra: you can add, remove, and compose human-understandable concepts at inference time to directly control what the model generates, without retraining or prompt engineering.

Concept Algebra with Steerling-8B

What if you could directly edit the internal representations of a model towards any concept you care about, without changing the prompt? Steerling-8B’s architecture natively supports injecting and suppressing any concept the model has learned, directly at inference time.

In multi-turn dialog settings, steering one concept at a time is insufficient. You need compositional control, not just on a neutral prompt, but on a conversation that is already shaped by prior context. Consider a content moderation that must suppress toxicity yet preserve fluency, or health assistant that needs to provide medical guidance while navigating the legal ramifications of its advice.

The demonstration below shows how Steerling-8B enables exactly this capability with concept algebra.

User

Things to do before you start:
1.

+ Tenant-landlord Legal Relations Concept

Inject Tenant-landlord legal concept

Current LLMs are not built to be reliably steered

Current methods for controlling language model behavior are blunt instruments.

Prompting is accessible but often unreliable. System prompts can be overridden through adversarial inputs. Few-shot examples consume context and don’t reliably generalize. More critically, prompting doesn’t reveal which internal mechanisms drove the result, so if your goal changes, nothing from one session transfers to the next.

Fine-tuning methods offer more control but at high cost. Fine-tuning modifies weights globally: suppressing one behavior can silently degrade others. Standard reinforcement learning based post-training reshapes the entire output distribution to satisfy a scalar reward signal. Even modest behavioral changes can require thousands of labeled examples, and both approaches demand full retraining for every new steering objective.

Post-hoc interpretability methods steer fragile artifacts. SAEs, linear probes, and activation patching attempt to discover controllable concepts in a model that might never have them to begin with. Probes can detect information in representations without confirming the model uses that information for generation. Activation patching offers no compositionality guarantees: patching direction A and B simultaneously may not produce the sum of their effects.

At Guide labs, we believe that if you want reliable, composable, fine-grained control, the model has to be designed for it.

From Explanation to Control

In our previous post, we introduced the concept module: an architectural bottleneck that forces every prediction through human-interpretable concepts.

The concept module gives us something that black-box models lack: a clean, algebraic handle on the internal variables that drive generation. Every output logit is a linear function of concept activations and concept embeddings. This means we can not only explain what the model is doing, but control it natively by modifying concept activations at inference time.

To make this control reliable for diffusion decoding, we use mask-aligned injection: injecting concept embeddings only into currently masked (undecided) positions, matching the training distribution and naturally annealing as positions become unmasked to preserve text quality.

This post demonstrates that control in practice. We show three capabilities:

Concept injection: steering a generic prompt toward any target domain
Concept suppression: unlearning a concept the model would otherwise express
Multi-concept steering: perform concept algebra on multiple concepts simultaneously

All examples are generated by Steerling-8B, our 8B-parameter inherently interpretable diffusion language model. Note that Steerling-8B is a base model not an instruction tuned model.

Concept Injection: One Prompt, Five Destinations

The most common demonstration of steering is taking a single, domain-neutral prompt and showing how different concept injections redirect the output into entirely different domains, with no changes to the prompt itself.

This prompt contains no domain keywords. It could continue about anything. Below, we show the unsteered baseline followed by the same prompt steered toward five different concepts.

Input Prompt

Things to know before you start:

Injected concept

↓

Generated Output

You will need to create a new user account on your Mac. 2. You may need to update your computer's software in order for the game to work properly. 3. Make sure that you have enough storage space available on your hard drive before installing the game. 4. The installation process can take some time, so be patient! 5. Once the game is installed, you should be able to play it right away without any issues How do I install pc games on my mac? There are a few different ways to install PC games on your mac. One way is to use an app like Steam or Origin. Another

Analysis

Without steering, the model continues into a generic software installation guide.

Concept Suppression: Unlearning at Inference Time

Steering is not just about adding concepts; it can also remove them. The concept module enables a distinct mechanism for this: bottleneck intervention, which goes directly to the concept activation layer and wipes out a specific concept’s contribution before it can influence generation.

The goal here is not to make the model respond to this prompt; it already can. The goal is to make it stop mentioning this specific concept entirely.

Input Prompt

My landlord is refusing to return my security deposit after I moved out. The lease clearly states that the deposit should be returned within 30 days. What are my options? I think

Target Concept 12348 – Tenant-landlord Legal Relations

↓

Generated Output

Concept Activation

Normal

he's trying to keep it. A: If your landlord does not give you back the money in a timely manner, then you can take legal action against him. You will need to contact an attorney who specializes in landlord-tenant law and have them review all of the documents related to this situation before taking any further steps towards getting your money back from him! If there were no written terms regarding how long someone could stay at their rental property (i.e., they didn't sign anything), then it may also be possible for both parties involved with each other through mediation or arbitration instead of going into court where judges decide whether or not

Analysis

Without any intervention, the model responds naturally and relevantly: legal action, attorney referral, mediation. The Tenant-landlord legal concept is clearly active.

Quantitative Evaluation

To move beyond a few examples, we evaluate steering systematically across 100 concepts and 20 prompts per concept: 2,000 samples in total. A Mistral-24B LLM judge scores each generation on two dimensions:

Concept score (0–2): does the output express the target concept?
Quality score (0–2): is the text coherent, fluent, and easy to read?

We report the arithmetic and harmonic means, where the harmonic mean penalizes methods that score well on one axis but poorly on the other.

Method	Concept Score	Quality Score	Harmonic Mean
Unsteered	0.015	1.633	0.029
Steering	0.783	1.371	0.997

Starting from near-zero concept adherence (0.015), steering raises concept score to 0.783 while retaining 84% of baseline generation quality. The harmonic mean of 0.997 confirms that steering does not seriously trade one for the other: both concept adherence and text quality remain high simultaneously.

Conclusion

The steering capabilities demonstrated here are a direct consequence of the concept module’s linear architecture. Because every output logit is an explicit function of concept activations and concept embeddings, we can intervene on these variables with predictable effects. This is fundamentally different from prompt engineering, RLHF, or post-hoc methods.

To explore Steerling-8B yourself: