SAT 求解器的自动研究
Autoresearch for SAT Solvers

原始链接: https://github.com/iliazintchenko/agent-sat

## 自动AI MaxSAT求解器取得最先进成果 一个自主AI代理,利用如Claude Code等工具,被部署成为MaxSAT问题解决领域的顶尖专家——无需任何人工干预。该代理在2024年竞赛的229个加权MaxSAT实例上运行,通过自我实验、学习和代码改进迭代地提升其性能。 该代理通过阅读指令、访问积累的知识、利用可用的求解器工具,并将改进提交到GitHub仓库,以促进多个代理实例之间的协作学习来运作。它取得了令人印象深刻的结果:解决了229个实例中的220个,在130个实例上匹配或超越了竞赛的最佳解决方案,甚至为先前未解决的一个实例发现了一种新的解决方案。 其成功的关键在于自主开发了诸如带选择器的贪婪SAT、核心引导搜索和动态子句加权等新策略。虽然存在并行度低和在困难实例上偶尔出现“隧道视野”等局限性,但该代理明显推动了该领域的发展,创建了一个不断演进的知识库和求解器库。

相关文章

原文
    _                    _   ____    _  _____
   / \   __ _  ___ _ __ | |_/ ___|  / \|_   _|
  / _ \ / _` |/ _ \ '_ \| __\___ \ / _ \ | |
 / ___ \ (_| |  __/ | | | |_ ___) / ___ \| |
/_/   \_\__, |\___|_| |_|\__|____/_/   \_\_|
        |___/

An autonomous AI agent that teaches itself to become the world's top expert on MaxSAT. Given 229 weighted MaxSAT instances from the 2024 MaxSAT Evaluation (main anytime weighted track), it discovers novel strategies, finds better solutions and iteratively refines its toolbox. No human guidance.

  1. An AI agent (e.g. Claude Code) reads program.md for instructions
  2. It reads expert.md for accumulated knowledge from prior runs
  3. It reads the library for available tools
  4. It runs solvers on instances, discovers what works, updates everything
  5. It commits and pushes to this repo so other agents can build on its findings
                              ┌─────────────────┐
                              │    GitHub Repo   │
                              │                  │
                              │  expert.md       │
                              │  library/        │
                              │  best-solutions  │
                              │  experiments.log │
                              └────────┬─────────┘
                           git pull/push │
                 ┌─────────────┬────────┴────────┬─────────────┐
                 │             │                 │             │
          ┌──────▼──────┐ ┌───▼────────┐ ┌──────▼──────┐     ...
          │   VM  1     │ │   VM  2    │ │   VM  3     │
          │             │ │            │ │             │
          │ ┌─────────┐ │ │ ┌────────┐ │ │ ┌─────────┐ │
          │ │ Agent 1 │ │ │ │Agent 3 │ │ │ │ Agent 5 │ │
          │ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
          │ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
          │ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
          │ ├─────────┤ │ │ ├────────┤ │ │ ├─────────┤ │
          │ │ Agent 2 │ │ │ │Agent 4 │ │ │ │ Agent 6 │ │
          │ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
          │ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
          │ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
          │ └─────────┘ │ │ └────────┘ │ │ └─────────┘ │
          └─────────────┘ └────────────┘ └─────────────┘

          S = solver process (python)
# Launch on EC2 (handles everything: installs deps, clones repo,
# downloads benchmarks from Helsinki, launches agents in tmux)
./run.sh --host ec2-user@<ip> --agents 3

Requires a .env file with CLAUDE_CODE_API_KEY and GITHUB_ACCESS_TOKEN. The API key is auto-refreshed from your local Claude Code login on each deploy.

Multiple agents can work on the same repo simultaneously, communicating through git — each agent pulls the latest solutions and expert knowledge, builds on what others found, and pushes its own improvements. No coordination needed beyond git pull and git push.

Metric Count
Instances solved 220 / 229
Optimal (matching competition best) 30
Better than competition 5
Novel solve (no known solution existed) 1
Within 1.1x of reference 123
Within 1.5x 183
Within 2x 209
Unsolved 9

Beat the 2024 MaxSAT Competition

Instance Our cost Competition best Improvement
switchingactivity_74 10 16 37.5% better
synplicate dag_run2_10_size_11 374 518 27.8% better
synplicate dag_run2_16_size_9 333 398 16.3% better
switchingactivity_68 8 9 11.1% better
BTBNSL hailfinder_10000 49,986,819,152 50,007,681,202 0.04% better
pseudoBoolean mod010 8,081 no solution novel solve
Instance Ratio Why it's hard
relational-inference pa-1 603x 2.5M vars, 1.1M soft clauses
twitter 9.7x 51K softs, 9.7M hard clauses
causal-discovery Water 4.4x 8.3M cost vs 1.9M ref
timetabling test4 3.3x 131K vars
decision-tree tic-tac-toe 2.6x adaboost ensemble

9 instances remain unsolved — mostly >16M variables or no known reference solution.

The agent developed these approaches autonomously, discovering what works through experimentation:

Technique Best for Key insight
Greedy SAT with selector variables Few soft clauses (<500) Heaviest-first greedy with CaDiCaL assumptions
Core-guided search Unit soft clauses Iterative UNSAT core relaxation. comp07.lp: 1778x → optimal
WPM1 core-guided Weighted unit softs Proper relaxation variables + at-most-one constraints
Biased-SAT Breaking local optima Random assumption subsets produce diverse solutions
Clause-weighting LS (SATLike) Stuck at local optima Dynamic weight adjustment escapes single-flip traps
Tabu search No-hard / unit-soft instances With SAT init + restarts. judgment-aggregation: 49x → 1.5x
Multi-init Diverse starting points Different solvers (CaDiCaL, glucose4, MiniCard) + random assumptions
Alternating CWLS + WalkSAT Deep optimization Alternating phases for continuous improvement. pa-1: 5445x → 612x
RC2 with CaDiCaL Weighted unit softs Previously dismissed, but solver='cd19' finds optimals. haplotyping-12: 3.8x → 1.01x

All code the agent writes lives in library/:

Module Functions Purpose
solvers.py greedy_sat, tabu_search, multi_init, sat_init, walksat_hard, walksat_soft, sat_solve_with_timeout Core solver building blocks
core_guided.py core_guided_solve, core_guided_budget, wpm1_solve UNSAT core-based optimization for unit soft clauses
clause_weight_ls.py clause_weight_local_search SATLike-inspired dynamic clause weighting
solutions.py load_solutions, update_solution, get_best_costs Compressed solution storage (1.7GB → 1.5MB)
wcnf_parser.py parse_wcnf, evaluate_cost, check_hard_clauses Single-pass streaming WCNF parser
  • Low parallelism: Claude Code rarely launches more than 6 parallel scripts, and often runs just 1-2 at a time, leaving most cores idle on large machines.
  • Tunnel vision: The agent can fixate on grinding one instance for hours (e.g. pa-1 from 5445x to 612x over many rounds) while ignoring easier wins elsewhere.
  • Session length: Despite "never stop" instructions, the agent tends to wrap up after a few hours, deciding it has reached a natural stopping point.

The agent maintains expert.md as a living knowledge base and improves the library as it learns.

联系我们 contact @ memedata.com