500美元的GPU在编码基准测试中表现优于Claude Sonnet。

500美元的GPU在编码基准测试中表现优于Claude Sonnet。
$500 GPU outperforms Claude Sonnet on coding benchmarks

原始链接: https://github.com/itigges22/ATLAS

## ATLAS：可自托管、媲美API模型的AI系统 ATLAS是一个新颖的系统，使用*冻结*的14B Qwen3模型和单个RTX 5060 Ti GPU，在具有挑战性的LiveCodeBench基准测试中达到了74.6%的通过率——显著优于先前版本。这种性能可与GPT-5和Claude等昂贵的API模型相媲美，但具有关键优势：**无需微调，无需API调用，且完全自托管。** ATLAS通过智能地用三阶段流程封装冻结模型来运行：**PlanSearch**用于问题分解，**自我验证的改进**使用模型生成测试用例，以及**几何透镜**用于候选选择。这种迭代过程，结合修复机制（PR-CoT），使其能够克服单次生成（single-shot generation）的局限性。重要的是，ATLAS优先考虑数据隐私和成本效益——所有处理都在本地进行，仅产生电力成本（约$0.004/任务）。未来的开发（V3.1）将侧重于提高速度，扩展基准测试覆盖范围到编码之外，并完善核心组件以实现更广泛的适用性。该项目是开源的，旨在提高跨硬件配置的可移植性。

现在，只需500美元的GPU，在编码基准测试中就能超越Claude Sonnet模型，这归功于ATLAS模型和itigges22详细介绍的技术。ATLAS的通过率达到74.6%（而DeepSeek V3.2为86.2%），成本却显著降低——大约0.004美元，而DeepSeek的API访问费用约为0.002美元。 ATLAS的效率来自于生成多个代码解决方案，并使用一个“成本场”（一个小型神经网络）来预测哪个解决方案最有可能正确，*在*测试之前。该网络分析代码“指纹”（嵌入向量）以识别高质量代码，并能正确预测最佳解决方案88%的时间。虽然基准测试结果令人鼓舞，但评论员指出实际可用性和效率的重要性。讨论还集中在硬件兼容性上，目前该设置依赖于Nvidia，但AMD显卡在AI性能方面正在提高。此外，还需要进一步研究token处理速度。

原文

Adaptive Test-time Learning and Autonomous Specialization

A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box.

Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)

Benchmark	Score	Tasks	Method
LiveCodeBench v5	74.6% pass@1-v(k=3)*	599	V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score
GPQA Diamond	47.0%	198	k=5, multiple-choice knowledge reasoning, V2 Score
SciCode	14.7% (sub-problems)	341	k=1, cross-domain scientific coding, V2 Score

*pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation, it is not pass@1. See methodology.

V3 ablation breakdown

Condition	Configuration	Pass Rate	Delta
A	Baseline (no V3)	54.9%	--
B	+Phase 1 (PlanSearch + BudgetForcing + DivSampling)	67.3%	+12.4pp
C	+Phase 1+2 (Lens routing)	67.3%	+0.0pp
D	+Phase 1+3 (self-verified refinement)	74.6%	+7.3pp

Phase 3 uses self-generated test cases for internal verification -- the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md

Cost and Performance Context

System	LCB pass@1	Est. cost/task	Notes
DeepSeek V3.2 Reasoning	86.2%	~$0.002	API, single-shot
GPT-5 (high)	84.6%	~$0.043	API, single-shot
ATLAS V3 (pass@1-v(k=3))	74.6%	~$0.004	Local electricity only, best-of-3 + repair pipeline
Claude 4.5 Sonnet	71.4%	~$0.066	API, single-shot
Claude 4 Sonnet	65.5%	~$0.066	API, single-shot

Methodology notes & sources

Methodology notes: ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head. API costs assume ~2,000 input + ~4,000 output tokens per task at current pricing. ATLAS cost = electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost -- the pipeline takes longer per task than a single API call, but no data leaves the machine.

Sources: Artificial Analysis LCB Leaderboard | AA Benchmarking Methodology | LiveCodeBench Paper (arXiv) | LCB Dataset (HuggingFace) | Pricing: OpenAI, Anthropic, DeepSeek

flowchart LR
  subgraph Phase1["Phase 1: Generate"]
    PS[PlanSearch<br/>Constraint extraction<br/>+ diverse plans]
    BF[Budget Forcing<br/>Thinking token<br/>control]
  end

  subgraph Verify["Score + Test"]
    GL[Geometric Lens<br/>C x energy scoring<br/>5120-dim self-embeddings]
    SB[Sandbox<br/>Code execution]
  end

  subgraph Phase3["Phase 3: Repair"]
    ST[Self-Test Gen<br/>Model-generated<br/>I/O pairs]
    PR[PR-CoT Repair<br/>Multi-perspective<br/>chain-of-thought]
  end

  PS --> BF
  BF -->|k=3 candidates| GL
  GL -->|energy-sorted| SB
  SB -->|all fail| ST
  ST --> PR
  PR -->|repaired code| SB

  style GL fill:#2d5016,color:#fff
  style PS fill:#1a3a5c,color:#fff
  style BF fill:#1a3a5c,color:#fff
  style SB fill:#2d5016,color:#fff
  style ST fill:#5c3a1a,color:#fff
  style PR fill:#5c3a1a,color:#fff

A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s) and 5120-dim self-embeddings for Lens scoring. The Geometric Lens C(x) energy field selects the best candidate (87.8% accuracy on mixed-result tasks). Failed tasks enter Phase 3, where the model generates its own test cases and iteratively repairs solutions via PR-CoT -- real tests are used only for final scoring.

Full architecture: docs/ARCHITECTURE.md

Before you begin: ATLAS was developed and tested on specific hardware. Read the Hardware & Reproduction section below to check compatibility and tune variables for your setup before running.

git clone https://github.com/itigges22/ATLAS.git && cd ATLAS

cp atlas.conf.example atlas.conf    # set MODEL_PATH, DATA_DIR, GPU device
sudo ./scripts/install.sh
./scripts/verify-install.sh

# Run V3 benchmark
python3 benchmark/v3_runner.py

See docs/SETUP.md for full installation instructions.

Resource	Minimum	Tested
GPU VRAM	16 GB	RTX 5060 Ti 16 GB
System RAM	14 GB	16 GB
Python	3.10+	3.11
OS	RHEL 9 / Ubuntu 24	RHEL 9 (Proxmox VM)

Reproduction details

V3 results were produced on RHEL 9 running as a Proxmox VM with an RTX 5060 Ti 16GB passed through via VFIO. Other NVIDIA GPUs with 16GB+ VRAM should work, though you may need to adjust driver versions and VRAM allocation.

The pipeline is not yet plug-and-play on arbitrary hardware -- V3.1 will improve portability. That said, Claude Code can be used to retrofit the pipeline to your specific setup (different GPU, OS, VRAM budget).

Key variables to tune for your hardware:

--parallel slots (default 2 -- reduce to 1 if VRAM is tight)
KV cache quantization (Q4_0 -- see ARCHITECTURE.md for VRAM breakdown)
Context per slot (default 20480 tokens)
CUDA driver version (tested on CUDA 12.8)

Full VRAM budget breakdown is documented in docs/ARCHITECTURE.md. Community reproduction attempts are welcome -- open an issue with your hardware config and results.

benchmark/       Benchmark suite (V2 runner, V3 pipeline, datasets)
benchmark/v3/    V3 subsystems (16 modules: PlanSearch, BudgetForcing, PR-CoT, etc.)
rag-api/         Core API: Geometric Lens, confidence router, RAG, cache
llama-server/    Patched llama.cpp server (spec decode + self-embeddings)
manifests/       K3s deployment manifests
scripts/         Installation and management scripts
tests/           Test suite (infrastructure, integration, V3)
docs/            Architecture, setup, configuration, troubleshooting
api-portal/      API key management portal (JWT auth, web UI)
sandbox/         Isolated code execution environment

Historical documentation

V3.0 -- Complete (2026-03-05)

74.6% LCB pass@1-v(k=3) on frozen Qwen3-14B-Q4_K_M. PlanSearch + BudgetForcing + Geometric Lens + PR-CoT repair pipeline. Full ablation report.

These are actively being addressed in V3.1:

LCB-only optimization. V3 phases were designed and tuned for LiveCodeBench. GPQA Diamond (47.0%) and SciCode (14.7%) results are included but those benchmarks were not optimized for. Cross-domain generalization is a V3.1 priority.
Phase 2 (Geometric Lens routing) contributed +0.0pp. C(x) was retrained on self-embeddings for V3 (fixing the V2 nomic embedding failure), but the training dataset was only ~60 samples -- far too small to learn a meaningful energy landscape. With an undertrained C(x), the Lens cannot discriminate candidates during routing. V3.1 retrains C(x) on a properly sized dataset drawn from real benchmark problems.
G(x) metric tensor is dormant. G(x) is downstream of C(x): it applies metric corrections to C(x)'s gradient signal. With C(x) undertrained and producing a weak/noisy energy landscape, G(x) has no meaningful geometry to navigate -- the correction term Δx = -G⁻¹∇C contributes nothing. G(x) is currently being redesigned from the ground up; V3.1 will either ship a working redesign or remove it entirely pending further research.
Single-threaded task pipeline. Tasks are processed sequentially. V3.1 adds task-level parallelization to improve benchmark throughput.
SandboxAdapter stdio bug. S* distinguishing input tiebreaking is implemented but non-functional on LCB tasks due to a stdio handling bug in the SandboxAdapter. Fixed in V3.1.

Model swap: Qwen3-14B → Qwen3.5-9B with DeltaNet linear attention architecture. Native multi-token prediction (MTP) gives ~3-4x throughput improvement at comparable or better accuracy. Smaller model also frees VRAM headroom.
Lens Evolution: Online C(x) recalibration -- Geometric Lens updates based on benchmark feedback rather than remaining static after initial training.
Phase 2 redesign: Retrain C(x) on a properly sized self-embedding dataset, remove or redesign G(x), fix SandboxAdapter stdio bug.
Task parallelization: Parallel task execution for faster benchmark runs.
Broader benchmark suite: See below.

V3.1 benchmark suite (planned)

V3 was evaluated only on LiveCodeBench v5. V3.1 expands evaluation to cover coding, reasoning, and general knowledge -- because ATLAS is not purely a coding system. The Confidence Router allocates compute based on task difficulty: simple knowledge questions route to raw inference + RAG (~30 seconds per response), while hard coding problems use the full V3 pipeline (PlanSearch + best-of-3 + PR-CoT repair), which can take up to 20 minutes per task. The benchmark suite should reflect this full range.

Coding benchmarks (primary):

LiveCodeBench v5 -- 599 problems, contamination-resistant, primary benchmark (done in V3)
SciCode -- cross-domain scientific coding, run in V2 (14.7% sub-problems), needs V3 re-evaluation
Additional contamination-resistant coding benchmarks as they emerge

Reasoning and knowledge benchmarks (from Artificial Analysis Intelligence Index):

GPQA Diamond -- scientific reasoning, run in V2 (47.0%), needs V3 re-evaluation under full pipeline
AA-LCR (Long Context Reasoning) -- tests reasoning over extended context, relevant given 20480-token per-slot context window
AA-Omniscience -- knowledge accuracy and hallucination rate; important for general-purpose use cases where users ask factual questions rather than coding problems
Humanity's Last Exam -- extreme reasoning and knowledge, useful as a ceiling test
CritPt -- physics reasoning, tests cross-domain generalization beyond software

General knowledge benchmarks matter because ATLAS is designed as a general-purpose self-hosted AI system, not a coding-only tool. The Confidence Router handles this by routing knowledge queries directly to raw inference + RAG (~30s), while reserving the full pipeline for hard coding problems (~20min). Benchmarks like AA-Omniscience and Humanity's Last Exam validate the fast-path routing, not the coding pipeline.

None of these V3.1 benchmarks have been run yet. This section is forward-looking roadmap only.

Target: 80-90% LCB pass@1-v(k=3) with faster per-task throughput.

Licensed under the A.T.L.A.S Source Available License v1.0 -- see LICENSE.