你可能用错了智能体技能
You're probably using Agent Skills wrong

原始链接: https://notes.ansonbiggs.com/youre-probably-using-agent-skills-wrong/

本文评论了《SkillsBench》这篇论文,该论文认为人工智能自生成的“技能”(Skills)是无效的。作者指出,该研究的方法存在缺陷,因为它将“自生成”视为一种基于提示的思维块,而非结构化的知识获取工具。 作者认为,如果使用不当,技能确实是多余的;但若实施得当,它们对于管理无状态代理(stateless agents)至关重要。作者列举了技能的三种主要应用场景: 1. **上下文管理:** 在大型项目(如单体仓库)中弥补知识缺口,此时全局指令(如 *CLAUDE.md*)往往不足以应对。 2. **效率提升:** 自动化处理重复性、周期性的工作流程。 3. **复杂问题解决:** 将过往失败中获得的宝贵经验进行代码化,以防止代理重复犯错。 最终,作者强调,只有当技能提供了一个“全新”模型原本不具备的信息时,它才具有价值。有效的技能需要基于实际项目挑战进行精心策划,而非简单地要求代理即时生成程序指令。当被视为审慎的文档和工具增强手段时,技能可以显著提升代理的性能。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 你可能用错了智能体(Agent)技能 (ansonbiggs.com) MisterBiggs 1 小时前发布 | 9 点 | 隐藏 | 过往 | 收藏 | 2 条评论 帮助 theowaway213456 23 分钟前 简而言之:不要让你的智能体仅凭其潜伏知识(latent knowledge)来编写技能,否则你还不如直接让它实时调用那些潜伏知识,根本没必要使用所谓的“技能”。 不过我不确定这个观点是否正确。我怀疑自动生成的技能可以帮助智能体避免“解压”其潜伏知识的过程,这或许能节省 Token?我不知道,我不是专家。 回复 bigcat12345678 14 分钟前 | 父评论 我现在规定不让智能体编写任何文档或流程。基本上,大语言模型自动生成的内容都没有重复使用价值。 回复 准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请加入 YC | 联系 搜索:
相关文章

原文

But thankfully for you I'm smart enough to show you how its done

Closeup of a honeycomb with some intense orange lighting. IDK what I was going for with this when I originally picked it.
Photo by Lin Dai / Unsplash

The entire ecosystem around Claude Code is pretty confusing, the naming conventions are a mess and the pace of change is beyond any production tool I've seen. However Skills are probably the most misused. I see it at work at ton but a paper just came up on Hacker News:

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

The paper the got me fired up enough to write this post

The HN title is editorialized for some reason "Study: Self-generated Agent Skills are useless", but it immediately grabbed me since I get massive value from Skills written by Agents, but I also consistently see them misused by my peers. The concept is great, I've been looking at benchmarking specific parts of the Agentic ecosystem myself so this was highly relevant to me. Overall the paper is decent but one bullet invalidates the whole thing:

Self-Generated Skills: No Skills provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs’ latent domain knowledge.

So all they are doing is taking a problem that a model can't solve well on its own, and asking it to write about the task before attempting it. They just reinvented thinking blocks but worse!

The Skill Anti-Pattern

What they did is a very common mistake that I see constantly. My Agent it bad at this thing so I ask the Agent to write a skill on this thing. I'll reiterate this is identical to thinking blocks. In order for your Agent to create something worthwhile you have to make sure they can see the gaps. I see this as the classic CS intro where you ask someone to write out the steps to make a PB&J, you don't really understand what makes the problem hard until you've struggled through solving it.

This directly leads into the largest Faux Pas of the AI era, just asking a LLM someone elses question verbatim, and pasting the LLMs answer as your response. If I ask you how you did something cool with an Agent, and you just on the fly have a fresh Agent build me a SKILL.md on my question, I will kill you.

What are Skills

Before getting into proper usage, I just want to cover what skills are. As a primitive they are just markdown files that have some metadata at the top to help Agents/Tools know when to use them, and then the rest of the document is the skill. Each skill has its own folder so it can no only teach your Agent how to do something but also give it better tools.

.claude/skills/
└── monitor-gitlab-ci/
    ├── SKILL.md # The file metioned above
    ├── monitor_ci.sh # Complicated command
    └── references/ # Additional references 
        ├── api_commands.md
        ├── log_analysis.md
        └── troubleshooting.md

Above is a Skill I used a ton to let older versions of Claude work on my GitLab CI. It's a folder with a simple markdown Skill that just explained the setup and that the Agent needs to watch the CI until either a job fails or everything passes, a simple CLI to prevent the Agent from writing a script, and additional references for edge cases.

Skills for Context

Agents are completely stateless meaning that every new conversation is like meeting the model for the first time, it has no idea what your project is or what you were working on 10 minutes ago. CLAUDE.md does a lot to fix this, but for a large enough project it can't contain everything. If I open up a monorepo and tell Claude to run a SIL test then it is going to have to run around to figure out how to do that. It has to figure out what language the project is in, then look for common test patterns for that language, its going to see a complicated Docker Compose setup, its going to see that the containers need x86 but we're running on a Mac, then its going to look for CI, etc.

This can all be solved by writing Skills for common, but not universal patterns. Anytime a model struggles to do something in your project that you know is simple and basic, tell it to make a Skill covering the gaps in knowledge it had to complete that task.

Skills for Repetition

Another simple use for Skills is to explain tasks that you often do. For instance I often tell my Agents to make sure my docs/, MR description, Issue, and codebase are all in alignment. So, I made a simple Skill for it to keep me from typing it out all the time.

Skills for Hard Problems

Claude can solve some really hard problems, but it might take $500 in tokens and you might have to yell at it for reward hacking a few times. Almost any time I have to intervene on a problem, once the Agent it unstuck I ask it what the gap was that kept it from figuring it out on its own. Sometimes it something silly, but sometimes it is something genuinely insightful and I have Claude make a Skill to fill the gap.

Conclusion

I edited the original benchmark to do Skills my way and the results were as I suspected, the Agents nailed the test with proper Skills. I don't have the money to spend on fully validating this result but the first pass was good enough for me to be happy. I think this essentially doubles the amount of dataset needed for this benchmarks so I assume thats why the authors didn't include this method.

Remember, there are two reasons to make a skill – Remembering a novel problem, and avoiding repetition. If you are just making a fresh session with your Agent and asking for a Skill on x then its probably no value. It needs to know something the fresh model doesn't which can come from you're prompt explaining a common process, a compilation of knowledge gained from a hard problem, or even having it go off and do its own research on something that isn't novel.

Happy Hacking.

联系我们 contact @ memedata.com