500美元的GPU在编码基准测试中表现优于Claude Sonnet。
$500 GPU outperforms Claude Sonnet on coding benchmarks

原始链接: https://github.com/itigges22/ATLAS

## ATLAS:可自托管、媲美API模型的AI系统 ATLAS是一个新颖的系统,使用*冻结*的14B Qwen3模型和单个RTX 5060 Ti GPU,在具有挑战性的LiveCodeBench基准测试中达到了74.6%的通过率——显著优于先前版本。 这种性能可与GPT-5和Claude等昂贵的API模型相媲美,但具有关键优势:**无需微调,无需API调用,且完全自托管。** ATLAS通过智能地用三阶段流程封装冻结模型来运行:**PlanSearch**用于问题分解,**自我验证的改进**使用模型生成测试用例,以及**几何透镜**用于候选选择。 这种迭代过程,结合修复机制(PR-CoT),使其能够克服单次生成(single-shot generation)的局限性。 重要的是,ATLAS优先考虑数据隐私和成本效益——所有处理都在本地进行,仅产生电力成本(约$0.004/任务)。 未来的开发(V3.1)将侧重于提高速度,扩展基准测试覆盖范围到编码之外,并完善核心组件以实现更广泛的适用性。 该项目是开源的,旨在提高跨硬件配置的可移植性。

现在,只需500美元的GPU,在编码基准测试中就能超越Claude Sonnet模型,这归功于ATLAS模型和itigges22详细介绍的技术。ATLAS的通过率达到74.6%(而DeepSeek V3.2为86.2%),成本却显著降低——大约0.004美元,而DeepSeek的API访问费用约为0.002美元。 ATLAS的效率来自于生成多个代码解决方案,并使用一个“成本场”(一个小型神经网络)来预测哪个解决方案最有可能正确,*在*测试之前。该网络分析代码“指纹”(嵌入向量)以识别高质量代码,并能正确预测最佳解决方案88%的时间。 虽然基准测试结果令人鼓舞,但评论员指出实际可用性和效率的重要性。讨论还集中在硬件兼容性上,目前该设置依赖于Nvidia,但AMD显卡在AI性能方面正在提高。此外,还需要进一步研究token处理速度。
相关文章

原文

A.T.L.A.S

License Python K8s GPU Status

Adaptive Test-time Learning and Autonomous Specialization

A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box.


Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen)

Benchmark Score Tasks Method
LiveCodeBench v5 74.6% pass@1-v(k=3)* 599 V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score
GPQA Diamond 47.0% 198 k=5, multiple-choice knowledge reasoning, V2 Score
SciCode 14.7% (sub-problems) 341 k=1, cross-domain scientific coding, V2 Score

*pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation, it is not pass@1. See methodology.

V3 ablation breakdown
Condition Configuration Pass Rate Delta
A Baseline (no V3) 54.9% --
B +Phase 1 (PlanSearch + BudgetForcing + DivSampling) 67.3% +12.4pp
C +Phase 1+2 (Lens routing) 67.3% +0.0pp
D +Phase 1+3 (self-verified refinement) 74.6% +7.3pp

Phase 3 uses self-generated test cases for internal verification -- the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md

Cost and Performance Context

System LCB pass@1 Est. cost/task Notes
DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot
GPT-5 (high) 84.6% ~$0.043 API, single-shot
ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline
Claude 4.5 Sonnet 71.4% ~$0.066 API, single-shot
Claude 4 Sonnet 65.5% ~$0.066 API, single-shot
Methodology notes & sources

Methodology notes: ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head. API costs assume ~2,000 input + ~4,000 output tokens per task at current pricing. ATLAS cost = electricity at $0.12/kWh (~165W GPU, ~1h 55m for 599 tasks). ATLAS trades latency for cost -- the pipeline takes longer per task than a single API call, but no data leaves the machine.

Sources: Artificial Analysis LCB Leaderboard | AA Benchmarking Methodology | LiveCodeBench Paper (arXiv) | LCB Dataset (HuggingFace) | Pricing: OpenAI, Anthropic, DeepSeek


flowchart LR
  subgraph Phase1["Phase 1: Generate"]
    PS[PlanSearch<br/>Constraint extraction<br/>+ diverse plans]
    BF[Budget Forcing<br/>Thinking token<br/>control]
  end

  subgraph Verify["Score + Test"]
    GL[Geometric Lens<br/>C x energy scoring<br/>5120-dim self-embeddings]
    SB[Sandbox<br/>Code execution]
  end

  subgraph Phase3["Phase 3: Repair"]
    ST[Self-Test Gen<br/>Model-generated<br/>I/O pairs]
    PR[PR-CoT Repair<br/>Multi-perspective<br/>chain-of-thought]
  end

  PS --> BF
  BF -->|k=3 candidates| GL
  GL -->|energy-sorted| SB
  SB -->|all fail| ST
  ST --> PR
  PR -->|repaired code| SB

  style GL fill:#2d5016,color:#fff
  style PS fill:#1a3a5c,color:#fff
  style BF fill:#1a3a5c,color:#fff
  style SB fill:#2d5016,color:#fff
  style ST fill:#5c3a1a,color:#fff
  style PR fill:#5c3a1a,color:#fff
Loading

A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s) and 5120-dim self-embeddings for Lens scoring. The Geometric Lens C(x) energy field selects the best candidate (87.8% accuracy on mixed-result tasks). Failed tasks enter Phase 3, where the model generates its own test cases and iteratively repairs solutions via PR-CoT -- real tests are used only for final scoring.

Full architecture: docs/ARCHITECTURE.md


Before you begin: ATLAS was developed and tested on specific hardware. Read the Hardware & Reproduction section below to check compatibility and tune variables for your setup before running.

git clone https://github.com/itigges22/ATLAS.git && cd ATLAS

cp atlas.conf.example atlas.conf    # set MODEL_PATH, DATA_DIR, GPU device
sudo ./scripts/install.sh
./scripts/verify-install.sh

# Run V3 benchmark
python3 benchmark/v3_runner.py

See docs/SETUP.md for full installation instructions.


Resource Minimum Tested
GPU VRAM 16 GB RTX 5060 Ti 16 GB
System RAM 14 GB 16 GB
Python 3.10+ 3.11
OS RHEL 9 / Ubuntu 24 RHEL 9 (Proxmox VM)
Reproduction details

V3 results were produced on RHEL 9 running as a Proxmox VM with an RTX 5060 Ti 16GB passed through via VFIO. Other NVIDIA GPUs with 16GB+ VRAM should work, though you may need to adjust driver versions and VRAM allocation.

The pipeline is not yet plug-and-play on arbitrary hardware -- V3.1 will improve portability. That said, Claude Code can be used to retrofit the pipeline to your specific setup (different GPU, OS, VRAM budget).

Key variables to tune for your hardware:

  • --parallel slots (default 2 -- reduce to 1 if VRAM is tight)
  • KV cache quantization (Q4_0 -- see ARCHITECTURE.md for VRAM breakdown)
  • Context per slot (default 20480 tokens)
  • CUDA driver version (tested on CUDA 12.8)

Full VRAM budget breakdown is documented in docs/ARCHITECTURE.md. Community reproduction attempts are welcome -- open an issue with your hardware config and results.


benchmark/       Benchmark suite (V2 runner, V3 pipeline, datasets)
benchmark/v3/    V3 subsystems (16 modules: PlanSearch, BudgetForcing, PR-CoT, etc.)
rag-api/         Core API: Geometric Lens, confidence router, RAG, cache
llama-server/    Patched llama.cpp server (spec decode + self-embeddings)
manifests/       K3s deployment manifests
scripts/         Installation and management scripts
tests/           Test suite (infrastructure, integration, V3)
docs/            Architecture, setup, configuration, troubleshooting
api-portal/      API key management portal (JWT auth, web UI)
sandbox/         Isolated code execution environment

Historical documentation

V3.0 -- Complete (2026-03-05)

74.6% LCB pass@1-v(k=3) on frozen Qwen3-14B-Q4_K_M. PlanSearch + BudgetForcing + Geometric Lens + PR-CoT repair pipeline. Full ablation report.

These are actively being addressed in V3.1:

  1. LCB-only optimization. V3 phases were designed and tuned for LiveCodeBench. GPQA Diamond (47.0%) and SciCode (14.7%) results are included but those benchmarks were not optimized for. Cross-domain generalization is a V3.1 priority.

  2. Phase 2 (Geometric Lens routing) contributed +0.0pp. C(x) was retrained on self-embeddings for V3 (fixing the V2 nomic embedding failure), but the training dataset was only ~60 samples -- far too small to learn a meaningful energy landscape. With an undertrained C(x), the Lens cannot discriminate candidates during routing. V3.1 retrains C(x) on a properly sized dataset drawn from real benchmark problems.

  3. G(x) metric tensor is dormant. G(x) is downstream of C(x): it applies metric corrections to C(x)'s gradient signal. With C(x) undertrained and producing a weak/noisy energy landscape, G(x) has no meaningful geometry to navigate -- the correction term Δx = -G⁻¹∇C contributes nothing. G(x) is currently being redesigned from the ground up; V3.1 will either ship a working redesign or remove it entirely pending further research.

  4. Single-threaded task pipeline. Tasks are processed sequentially. V3.1 adds task-level parallelization to improve benchmark throughput.

  5. SandboxAdapter stdio bug. S* distinguishing input tiebreaking is implemented but non-functional on LCB tasks due to a stdio handling bug in the SandboxAdapter. Fixed in V3.1.

  • Model swap: Qwen3-14B → Qwen3.5-9B with DeltaNet linear attention architecture. Native multi-token prediction (MTP) gives ~3-4x throughput improvement at comparable or better accuracy. Smaller model also frees VRAM headroom.
  • Lens Evolution: Online C(x) recalibration -- Geometric Lens updates based on benchmark feedback rather than remaining static after initial training.
  • Phase 2 redesign: Retrain C(x) on a properly sized self-embedding dataset, remove or redesign G(x), fix SandboxAdapter stdio bug.
  • Task parallelization: Parallel task execution for faster benchmark runs.
  • Broader benchmark suite: See below.
V3.1 benchmark suite (planned)

V3 was evaluated only on LiveCodeBench v5. V3.1 expands evaluation to cover coding, reasoning, and general knowledge -- because ATLAS is not purely a coding system. The Confidence Router allocates compute based on task difficulty: simple knowledge questions route to raw inference + RAG (~30 seconds per response), while hard coding problems use the full V3 pipeline (PlanSearch + best-of-3 + PR-CoT repair), which can take up to 20 minutes per task. The benchmark suite should reflect this full range.

Coding benchmarks (primary):

  • LiveCodeBench v5 -- 599 problems, contamination-resistant, primary benchmark (done in V3)
  • SciCode -- cross-domain scientific coding, run in V2 (14.7% sub-problems), needs V3 re-evaluation
  • Additional contamination-resistant coding benchmarks as they emerge

Reasoning and knowledge benchmarks (from Artificial Analysis Intelligence Index):

  • GPQA Diamond -- scientific reasoning, run in V2 (47.0%), needs V3 re-evaluation under full pipeline
  • AA-LCR (Long Context Reasoning) -- tests reasoning over extended context, relevant given 20480-token per-slot context window
  • AA-Omniscience -- knowledge accuracy and hallucination rate; important for general-purpose use cases where users ask factual questions rather than coding problems
  • Humanity's Last Exam -- extreme reasoning and knowledge, useful as a ceiling test
  • CritPt -- physics reasoning, tests cross-domain generalization beyond software

General knowledge benchmarks matter because ATLAS is designed as a general-purpose self-hosted AI system, not a coding-only tool. The Confidence Router handles this by routing knowledge queries directly to raw inference + RAG (~30s), while reserving the full pipeline for hard coding problems (~20min). Benchmarks like AA-Omniscience and Humanity's Last Exam validate the fast-path routing, not the coding pipeline.

None of these V3.1 benchmarks have been run yet. This section is forward-looking roadmap only.

Target: 80-90% LCB pass@1-v(k=3) with faster per-task throughput.


Licensed under the A.T.L.A.S Source Available License v1.0 -- see LICENSE.

联系我们 contact @ memedata.com