用游戏竞技场推进人工智能基准测试
Advancing AI Benchmarking with Game Arena

原始链接: https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/

Kaggle 正在启动一项新的 AI 基准测试和比赛系列,测试模型在国际象棋、狼人杀和现在,扑克中的表现。国际象棋需要推理,狼人杀依赖于社交演绎,而扑克则独特地以**不完全信息下的风险管理**挑战 AI。模型必须分析概率和对手行为才能成功,克服抽牌的固有运气。 一项新的扑克基准测试和单挑无限德州扑克比赛将决定顶尖的 AI 玩家,排行榜将于 2 月 4 日在 kaggle.com/game-arena 上公布。 为了庆祝,Kaggle 与游戏专家 Hikaru Nakamura、Nick Schulman、Doug Polk 和 Liv Boeree 合作,进行每日直播(太平洋时间上午 9:30),时间为 2 月 2 日至 4 日,内容包括比赛和分析。探索完整的竞技场并了解更多信息,请访问 kaggle.com/game-arena。

谷歌的博客文章讨论了使用“游戏竞技场”进行人工智能基准测试的进展,该平台通过游戏玩法评估人工智能代理。Hacker News上的讨论集中在定义通用人工智能(AGI)有意义的基准上。 一位评论员提出,AGI应该能够仅使用视觉和音频输入完成现代RPG或FPS游戏,而无需在特定游戏上进行事先训练。另一位则强调了“代码对决”,一个人工智能代理*创建*代理来参与扑克等游戏的基准测试——将Claude设计的代理与GPT设计的代理进行对决。 对话涉及了基准测试像扑克这样具有高方差的游戏的挑战(需要海量数据集)以及预计算最优策略的价值。用户还分享了经验,注意到Gemini最近在意图分析方面的改进,并建议像Nethack这样复杂的游戏作为潜在的基准。
相关文章

原文

Chess relies on reasoning. Werewolf relies on social deduction. Poker introduces a new dimension: risk management. Like Werewolf, poker is a game of imperfect information. But here, the challenge isn't about building alliances — it's about quantifying uncertainty. Models must overcome the luck of the deal by inferring their opponents' hands and adapting to their playing styles to determine the best move.

To put these skills to the test, we are launching a new poker benchmark and hosting an AI poker tournament, where the top models will compete in Heads-Up No-Limit Texas Hold'em. The final poker leaderboard will be revealed at kaggle.com/game-arena on Wednesday, Feb 4, following the conclusion of the tournament finals.

To learn how we evaluate model capability in poker, check out the Kaggle blog.

Watch the action

Marking the launch of these new and updated benchmarks, we have partnered with Chess Grandmaster Hikaru Nakamura and poker legends Nick Schulman, Doug Polk, and Liv Boeree to produce three livestreamed events with expert commentary and analysis across all three benchmarks.

Tune in to the three daily livestreams at 9:30 AM PT at kaggle.com/game-arena:

  • Monday, Feb 2: The top eight models on the poker leaderboard face off in the AI poker battle.
  • Tuesday, Feb 3: As the poker tournament semi-finals take place, we will also feature highlight matches from the Werewolf and chess leaderboards.
  • Wednesday, Feb 4: The final two models compete for the poker crown alongside the release of the full leaderboard. We conclude our coverage with a chess match between the top two models on the chess leaderboard — Gemini 3 Pro and Gemini 3 Flash — and will be streaming game highlights of the best Werewolf models.

Explore the arena

Whether it’s finding a creative checkmate, negotiating a truce in Werewolf, or going all in at the poker table, Kaggle Game Arena is where we find out what these models can really do.

Check it out at kaggle.com/game-arena.

联系我们 contact @ memedata.com