次二次——SubQ 1.1 Small 介绍

次二次——SubQ 1.1 Small 介绍
SubQ 1.1 Small

原始链接: https://subq.ai/subq-1-1-small-technical-report

SubQ 推出了 SubQ 1.1 Small 模型，该模型利用次二次稀疏注意力（SSA）机制，克服了传统稠密注意力机制在计算上的高开销限制。通过将二次缩放替换为一种学习到的稀疏公式，该模型在高达 1200 万 token 的上下文长度内实现了近乎完美的检索效果。在处理 100 万 token 时，SubQ 1.1 Small 的运行效率比稠密注意力机制高出 64.5 倍，速度比 FlashAttention-2 快 56 倍。尽管进行了这种优化，该模型在通用推理、编码和知识基准测试中仍保持着高水平表现，足以与许多前沿模型媲美。这种架构支持对海量数据（如整份法律合同、完整的代码库以及全面的金融文档集合）进行直接、整体的推理，无需繁琐的分块或检索变通方案。通过支持高效的数百万 token 实验，SubQ 旨在为软件工程和金融尽职调查等复杂领域解锁更深层的智能代理能力。目前，该公司正在与设计合作伙伴共同部署该模型，并计划于今年晚些时候进行更大范围的发布。

Hacker News 社区正在讨论 SubQ 1.1 Small 的发布。该模型声称在长上下文（最高可达 1200 万 token）下仍能保持近乎完美的检索性能，同时仅使用标准注意力机制 0.13% 的算力。对此公告的反应褒贬不一。热情的用户希望这项技术能带来成本更低、速度更快的 LLM，使开发人员能够在单个提示词中处理庞大的代码库。然而，许多评论者对该项目缺乏技术透明度表示怀疑。批评者指出，该实验室的“技术报告”内容模糊，未能解释其稀疏注意力机制的实际运作方式。一些用户将其保密做法与其他分享架构规范的实验室进行了对比，认为这种不透明的策略可能是为了避免被大型竞争对手复制。另一些人则对该实验室的声明持保留态度，并指出其以往的公告也缺乏可验证的细节或实质内容。总的来说，虽然其性能声明颇具吸引力，但在获得更确凿的证据之前，社区仍持观望态度。

原文

The hardest enterprise AI problems share a common shape. They require reasoning over complete artifacts: entire codebases, document collections, contracts, financial filings.

For years, the industry worked around this problem by building retrieval pipelines, chunking strategies, and agentic scaffolding — useful tools, but ultimately workarounds for context limitations of the model architecture. The underlying constraint was attention: compute that scales quadratically with context length, making direct reasoning over large artifacts prohibitively expensive.

SubQ is built to remove that constraint. Today we're releasing the model card for SubQ 1.1 Small — the second iteration of our Subquadratic Sparse Attention (SSA) model, at the smallest size. We are in the process of deploying SubQ 1.1 Small with select design partners and plan to deploy a broader lineup of models ranging from 2M to 12M tokens later in the year.

Key Features

Near-perfect long-context retrieval up to 12M tokens on the needle-in-a-haystack test, with up to nearly 1,000x attention compute reduction.
A balance of long-context optimization and general reasoning ability, with strong performance retained across knowledge, coding, and non-coding enterprise agent benchmarks.
At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.

These results reflect the scaling advantage that SSA's efficiency gains make possible.

Benchmarks

SubQ 1.1 Small was evaluated across five axes, covering long-context retrieval, context-length generalization, knowledge, coding, and long-horizon agentic tasks.

Long-Context Retrieval & Generalization

We selected Needle-In-A-Haystack (NIAH) and Nvidia's RULER test because together they test whether the model can find a single fact buried deep in a large context, and whether it can connect the dots across that context.

NIAH is the precision test. It places one retrievable fact at a controlled depth within a long context and asks the model to return it exactly. SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns.

RULER is the capability test. It's 13 tasks go beyond single-fact lookup to cover multi-hop variable tracing, frequency extraction, and aggregation across the full context using the kind of reasoning complete-artifact workloads actually require. SubQ 1.1 Small scores 99.12% at 128K.

Multi-task retrievalRULER (128K)

Single-fact retrievalNeedle-in-a-haystack (1M–12M)

General Knowledge & Reasoning

SubQ 1.1 Small balances long-context optimization with general reasoning ability without compromise. GPQA Diamond at 85.4% sits just below mid-tier frontier models and well above the smaller tier. LiveCodeBench at 89.7% pass@4 is close to the absolute frontier. AutomationBench Finance at 13% places SubQ 1.1 Small close to the strongest models on that benchmark, ahead of mid-tier and smaller baselines. Absolute scores remain low across all models on this benchmark.

Benchmark	SubQ 1.1 Small	GPT-5.5	Opus 4.8	Sonnet 4.6	GPT-5.4-mini	GPT-5.4-nano	Haiku 4.5
Graduate-level science GPQA Diamond · pass@1	85.4	93.2	92	87.5	87.5	81.7	67.2
Agentic finance AutomationBench	13%	18%	16%	8%	0%	n/r	3%
Competitive programming LiveCodeBench v6 · pass@4	89.7	92	92.2	88.9	78.6	78.2	69.7

Benchmark

SubQ 1.1 Small

GPT-5.5

Opus 4.8

Sonnet 4.6

GPT-5.4-mini

GPT-5.4-nano

Haiku 4.5

Graduate-level science

GPQA Diamond · pass@1

85.4

93.2

87.5

81.7

67.2

Agentic finance

AutomationBench

13%

18%

16%

n/r

Competitive programming

LiveCodeBench v6 · pass@4

89.7

92.2

88.9

78.6

78.2

69.7

n/r = result not reported by the model provider

Efficiency

SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. SSA's advantage over dense attention grows as context length increases. At 1M tokens, SubQ requires 64.5x fewer compute than dense attention and runs 56x faster than FlashAttention-2 on a single attention layer. In practice, this drastically changes the economics of long-context training and inference.

A full breakdown of the mechanism and how it compares to FlashAttention, DeepSeek sparse attention, and recurrent architectures is in the Technical Report.

SubQ uses 64.5x less compute than dense attention, and is 56× faster than FlashAttention-2 at 1M-token context

Training

We started with an existing open-weight frontier model, replaced dense attention with SSA, and built long-context capability through staged context extension (262K, 512K, 1M, 2M) followed by roughly one trillion tokens of continued pretraining on naturally long artifacts: books, documents, and repository-scale code.

The strongest lever we found for improving long-context retrieval was long-context continued pretraining, made possible by the efficiency of the SSA algorithm. The 12M generalization result reflects both factors: SSA's selection criterion is independent of absolute position, and the capability to use that generalization reliably develops through training on long data.

Additionally, we ran more than one hundred experiments across six to seven model generations to get the balance of capabilities between long- and short-context tasks right. That kind of iteration is only possible because SSA enabled our team to run multi-million-token experiments as a standard procedure rather than a rare event, making the research loop more efficient.

Use Cases

SubQ is designed for workloads that require reasoning over information distributed across the artifact without fragmentation. Here are just a few of the use cases from our initial research:

Financial analysis and due diligence. Filings, earnings reports, contracts, and internal records are only meaningful in combination. SubQ reasons across the full collection rather than summarizing each document in isolation.
Legal and contract work. A contract may define a term on page 2, qualify it on page 12, and carve out an exception on page 46. Retrieval finds the sentence but loses the relationships. SubQ holds the whole document and reasons across it directly.
Software engineering. Codebases distribute logic across files, modules, and dependencies in ways that short-context models can't hold at once. SubQ loads an entire repository into a single context window, enabling architecture-level reasoning, cross-file refactoring, and dependency tracing in one pass. We believe there will be significant value for long-context models in planning, review, and long-horizon memory within coding.

What's Next

We'll be kicking off with the first cohort of design partners in the next few weeks, with broader rollout through the quarter and general model releases by end of year.