SAT 求解器的自动研究

SAT 求解器的自动研究
Autoresearch for SAT Solvers

原始链接: https://github.com/iliazintchenko/agent-sat

## 自动AI MaxSAT求解器取得最先进成果一个自主AI代理，利用如Claude Code等工具，被部署成为MaxSAT问题解决领域的顶尖专家——无需任何人工干预。该代理在2024年竞赛的229个加权MaxSAT实例上运行，通过自我实验、学习和代码改进迭代地提升其性能。该代理通过阅读指令、访问积累的知识、利用可用的求解器工具，并将改进提交到GitHub仓库，以促进多个代理实例之间的协作学习来运作。它取得了令人印象深刻的结果：解决了229个实例中的220个，在130个实例上匹配或超越了竞赛的最佳解决方案，甚至为先前未解决的一个实例发现了一种新的解决方案。其成功的关键在于自主开发了诸如带选择器的贪婪SAT、核心引导搜索和动态子句加权等新策略。虽然存在并行度低和在困难实例上偶尔出现“隧道视野”等局限性，但该代理明显推动了该领域的发展，创建了一个不断演进的知识库和求解器库。

原文

    _                    _   ____    _  _____
   / \   __ _  ___ _ __ | |_/ ___|  / \|_   _|
  / _ \ / _` |/ _ \ '_ \| __\___ \ / _ \ | |
 / ___ \ (_| |  __/ | | | |_ ___) / ___ \| |
/_/   \_\__, |\___|_| |_|\__|____/_/   \_\_|
        |___/

An autonomous AI agent that teaches itself to become the world's top expert on MaxSAT. Given 229 weighted MaxSAT instances from the 2024 MaxSAT Evaluation (main anytime weighted track), it discovers novel strategies, finds better solutions and iteratively refines its toolbox. No human guidance.

An AI agent (e.g. Claude Code) reads program.md for instructions
It reads expert.md for accumulated knowledge from prior runs
It reads the library for available tools
It runs solvers on instances, discovers what works, updates everything
It commits and pushes to this repo so other agents can build on its findings

                              ┌─────────────────┐
                              │    GitHub Repo   │
                              │                  │
                              │  expert.md       │
                              │  library/        │
                              │  best-solutions  │
                              │  experiments.log │
                              └────────┬─────────┘
                           git pull/push │
                 ┌─────────────┬────────┴────────┬─────────────┐
                 │             │                 │             │
          ┌──────▼──────┐ ┌───▼────────┐ ┌──────▼──────┐     ...
          │   VM  1     │ │   VM  2    │ │   VM  3     │
          │             │ │            │ │             │
          │ ┌─────────┐ │ │ ┌────────┐ │ │ ┌─────────┐ │
          │ │ Agent 1 │ │ │ │Agent 3 │ │ │ │ Agent 5 │ │
          │ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
          │ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
          │ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
          │ ├─────────┤ │ │ ├────────┤ │ │ ├─────────┤ │
          │ │ Agent 2 │ │ │ │Agent 4 │ │ │ │ Agent 6 │ │
          │ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
          │ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
          │ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
          │ └─────────┘ │ │ └────────┘ │ │ └─────────┘ │
          └─────────────┘ └────────────┘ └─────────────┘

          S = solver process (python)

# Launch on EC2 (handles everything: installs deps, clones repo,
# downloads benchmarks from Helsinki, launches agents in tmux)
./run.sh --host ec2-user@<ip> --agents 3

Requires a .env file with CLAUDE_CODE_API_KEY and GITHUB_ACCESS_TOKEN. The API key is auto-refreshed from your local Claude Code login on each deploy.

Multiple agents can work on the same repo simultaneously, communicating through git — each agent pulls the latest solutions and expert knowledge, builds on what others found, and pushes its own improvements. No coordination needed beyond git pull and git push.

Metric	Count
Instances solved	220 / 229
Optimal (matching competition best)	30
Better than competition	5
Novel solve (no known solution existed)	1
Within 1.1x of reference	123
Within 1.5x	183
Within 2x	209
Unsolved	9

Beat the 2024 MaxSAT Competition

Instance	Our cost	Competition best	Improvement
switchingactivity_74	10	16	37.5% better
synplicate dag_run2_10_size_11	374	518	27.8% better
synplicate dag_run2_16_size_9	333	398	16.3% better
switchingactivity_68	8	9	11.1% better
BTBNSL hailfinder_10000	49,986,819,152	50,007,681,202	0.04% better
pseudoBoolean mod010	8,081	no solution	novel solve

Instance	Ratio	Why it's hard
relational-inference pa-1	603x	2.5M vars, 1.1M soft clauses
twitter	9.7x	51K softs, 9.7M hard clauses
causal-discovery Water	4.4x	8.3M cost vs 1.9M ref
timetabling test4	3.3x	131K vars
decision-tree tic-tac-toe	2.6x	adaboost ensemble

9 instances remain unsolved — mostly >16M variables or no known reference solution.

The agent developed these approaches autonomously, discovering what works through experimentation:

Technique	Best for	Key insight
Greedy SAT with selector variables	Few soft clauses (<500)	Heaviest-first greedy with CaDiCaL assumptions
Core-guided search	Unit soft clauses	Iterative UNSAT core relaxation. comp07.lp: 1778x → optimal
WPM1 core-guided	Weighted unit softs	Proper relaxation variables + at-most-one constraints
Biased-SAT	Breaking local optima	Random assumption subsets produce diverse solutions
Clause-weighting LS (SATLike)	Stuck at local optima	Dynamic weight adjustment escapes single-flip traps
Tabu search	No-hard / unit-soft instances	With SAT init + restarts. judgment-aggregation: 49x → 1.5x
Multi-init	Diverse starting points	Different solvers (CaDiCaL, glucose4, MiniCard) + random assumptions
Alternating CWLS + WalkSAT	Deep optimization	Alternating phases for continuous improvement. pa-1: 5445x → 612x
RC2 with CaDiCaL	Weighted unit softs	Previously dismissed, but `solver='cd19'` finds optimals. haplotyping-12: 3.8x → 1.01x

All code the agent writes lives in library/:

Module	Functions	Purpose
`solvers.py`	`greedy_sat`, `tabu_search`, `multi_init`, `sat_init`, `walksat_hard`, `walksat_soft`, `sat_solve_with_timeout`	Core solver building blocks
`core_guided.py`	`core_guided_solve`, `core_guided_budget`, `wpm1_solve`	UNSAT core-based optimization for unit soft clauses
`clause_weight_ls.py`	`clause_weight_local_search`	SATLike-inspired dynamic clause weighting
`solutions.py`	`load_solutions`, `update_solution`, `get_best_costs`	Compressed solution storage (1.7GB → 1.5MB)
`wcnf_parser.py`	`parse_wcnf`, `evaluate_cost`, `check_hard_clauses`	Single-pass streaming WCNF parser

Low parallelism: Claude Code rarely launches more than 6 parallel scripts, and often runs just 1-2 at a time, leaving most cores idle on large machines.
Tunnel vision: The agent can fixate on grinding one instance for hours (e.g. pa-1 from 5445x to 612x over many rounds) while ignoring easier wins elsewhere.
Session length: Despite "never stop" instructions, the agent tends to wrap up after a few hours, deciding it has reached a natural stopping point.

The agent maintains expert.md as a living knowledge base and improves the library as it learns.

SAT 求解器的自动研究 Autoresearch for SAT Solvers

Beat the 2024 MaxSAT Competition

SAT 求解器的自动研究
Autoresearch for SAT Solvers