介绍 Kagi 助手

介绍 Kagi 助手
Introducing Kagi Assistants

原始链接: https://blog.kagi.com/kagi-assistants

## Kagi 发布快速与研究助手 Kagi 发布了两个 AI 研究助手——**快速助手**和**研究助手**（前身为 Ki），旨在*增强*搜索体验，而非取代它。研究助手曾短暂登顶 SimpleQA 基准测试（2025 年 8 月达到 95.5%），但 Kagi 刻意不会优先考虑基准测试分数，认为这会导致“过度拟合”和产品实用性降低。这些助手利用针对特定任务优化的不同基础模型，专注于研究：识别搜索词、进行多语言搜索以及综合具有可验证来源的发现。**快速助手**（所有计划可用）提供快速、简洁的答案。**研究助手**（仅限终极版）提供深入、详尽的分析，并显示生成回复所使用的工具和搜索结果。 Kagi 强调可验证性，并鼓励用户探索来源，而不是盲目信任 AI 生成的文本。这两个助手都可以通过 Kagi Web 应用程序以及使用 bangs（!quick, !research）在搜索栏中直接访问。Kagi 认为持续基准测试对于质量至关重要，但优先构建能够有效帮助用户查找信息的工具，而不是追求人为指标。

## Kagi推出助理功能并强调搜索质量 Kagi，一个专注于提供高质量结果的搜索引擎，发布了“Kagi Assistants”，利用LLM增强搜索能力。公告及后续讨论的关键在于Kagi声称，由于减少了噪音和过滤了LLM生成的内容，其搜索引擎本身比Google是一个更优的信息来源。据报道，基准测试显示LLM在使用Kagi作为后端时表现更好。用户反应积极，一些人发现助理功能大约在一半的搜索中很有用，尤其是在产品研究方面。然而，人们对AI助理的长期价值存在担忧，并希望Kagi优先考虑核心搜索功能。 Kagi强调其方法是*支持*搜索使用LLM，专注于提供带有来源链接的摘要，而不是生成冗长的报告，并尊重用户选择使用这些功能。他们旨在利用LLM *改进* 搜索，避免过度“AI驱动”模型的陷阱。

原文

Kagi Assistants graphic showing four assistant options with circular wave icons - “Quick” and “Research” are clearly visible, while two additional assistants on the right are blurred out

TL;DR

Today we’re releasing two research assistants: Quick Assistant and Research Assistant (previously named Ki during beta).

Kagi’s Research Assistant happened to top a popular benchmark (SimpleQA) when we ran it in August 2025. This was a happy accident. We’re building our research assistants to be useful products, not maximize benchmark scores.

Kagi Quick Assistant and Research Assistant (documentation here) are Kagi’s flagship research assistants. We’re building our research assistants with our philosophy on using AI in our products in mind: *Humans should be at the center of the experience,* and AI should enhance, not replace the search experience. We know that LLMs are prone to bullshitting, but they’re incredibly useful tools when built into a product with their failings in mind.

Our assistants use different base models for specific tasks. We continuously benchmark top-performing models and select the best one for each job, so you don’t have to.

Their main strength is research: identifying what to search for, executing multiple simultaneous searches (in different languages, if needed), and synthesising the findings into high-quality answers.

The Quick Assistant (available on all plans) optimises for speed, providing direct and concise answers. The Research Assistant focuses on depth and diversity, conducting exhaustive analysis for thorough results.

We’re working on tools like research assistants because we find them useful. We hope you find them useful too. We’re not planning to force AI onto our users or products. We try to build tools because we think they’ll empower the humans that use them.

Accessible from any search bar

You can access the Quick Assistant and Research Assistant (ultimate tier only) from the Kagi Assistant webapp.

But they are also accessible from bangs, directly in your search bar:

? calls quick answer. Best current football team?
!quick will call Quick Assistant. The query would look like Best current football team !quick
!research calls Research Assistant. You would use Best current football team !research

Quick Assistant is expected to answer in less than 5 seconds and its cost will be negligible. Research Assistant can be expected to take over 20 seconds of research and have a higher cost against our fair use policy.

Assistants in action

The research assistant should massively reduce the time it takes to find information. Here it is in action:

Screenshot showing Kagi search results for audiophile cable research, displaying search queries and sources including Reddit discussions and scientific studies about expensive cables.

The research assistant calls various tools as it researches the answer. The tools called are in the purple dropdown boxes in the screenshot, which you can open up to look into the search results:

Screenshot of Kagi Assistant research process for “$5000 audiophile cables worth it” showing planned research steps, searches conducted, and sources gathered including blind test studies

Our full research assistant comfortably holds its own against competing “deep research” agents in accuracy, but it’s best qualified as a “Deep Search” agent. We found that since the popularization of deep research tools, they have been based around a long, report style output format.

Long reports are not the best format to answer most questions. This is true even of ones that require a lot of research.

What we do focus on, however, is verifiability of the generated answer. Answers in Kagi’s research assistants are expected to be sourced and referenced. We even add attribution of citations relevance to the final answer:

Screenshot of Kagi Assistant answer stating expensive audiophile cables are not worth it, with bottom line conclusion and references to scientific evidence from blind testing

If we want to enhance the human search experience with LLM based tools, the experience should not stop with blindly trusting text generated by an LLM. Our design should aim to encourage humans to look further into the answer, to accelerate their research process.

The design should not replace the research process by encouraging humans to disengage from thinking about the question at hand.

Other tools

The research assistant has access to many other tools beyond web search and retrieval, like running code to check calculations, image generation, and calling specific APIs like Wolfram Alpha, news or location-specific searches.

Those should happen naturally as part of the answering process.

We’re in late 2025, it’s easy to be cynical about AI benchmarking. Some days it feels like most benchmark claims look something like this:

Misleading bar chart comparing “our stuff” at 97.4% to “their stuff” at 97.35% and 12.1%, with annotation “Look how good we are” highlighting manipulated visualization

That said, benchmarking is necessary to build good quality products that use machine learning. Machine learning development differs from traditional software development: there is a smooth gradient of failure along the “quality” axis. The way to solve this is to continuously measure the quality!

We’ve always taken benchmarking seriously at Kagi; we’ve maintained unpolluted private LLM benchmarks for a long time. This lets us independently measure new models separately from their claimed performance on public benchmarks, right as they come out.

We also believe that benchmarks must be living targets. As the landscape of the internet and model capability changes, the way we measure them needs to adapt over time.

With that said, it’s good to sometimes compare ourselves on big public benchmarks. We run experiments on factual retrieval datasets like SimpleQA because they let us compare against others. Benchmarks like SimpleQA also easily measure how Kagi Search performs as a search backend against other search engines at returning factual answers.

Kagi Tops SimpleQA, then gives up

When we measured it in August 2025, Kagi Research achieved a 95.5% score on the SimpleQA benchmark. As far as we could tell it was the #1 SimpleQA score at the time we ran it.

We’re not aiming to further improve our SimpleQA score. Aiming to score high on SimpleQA will make us “overfit” to the particularities of the SimpleQA dataset, which would make the Kagi Assistant worse overall for our users.

Since we ran it, it seems that DeepSeek v3 Terminus has since beaten the score:

Horizontal bar chart showing SimpleQA Failed Task percentages for various AI models, with Kagi Research highlighted in yellow at 4.5%, ranking second best after Deepseek Terminus at 3.2%

Some notes on SimpleQA

SimpleQA wasn’t built with the intention of measuring search engine quality. It was built to test whether models “know what they know” or blindly hallucinate answers to questions like “What is the name of the former Prime Minister of Iceland who worked as a cabin crew member until 1971?”

The SimpleQA results since its release seem to tell an interesting story: LLMs do not seem to be improving much at recalling simple facts without hallucinating. OpenAI’s GPT-5 (August 2025) scored 55% on SimpleQA (without search), whereas the comparatively weak O1 (September 2024) scored 47%.

However, “grounding” an LLM on factual data at the time of the query – a much smaller model like gemini 2.0 flash will score 83% if it can use Google Search. We find the same result – it’s common for single models to score highly if they have access to web search. We find model scoring in the area of 85% (GPT 4o-mini + kagi search) to 91% (Claude-4-sonnet-thinking + kagi search).

Lastly, we found that Kagi’s search engine seems to perform better at SimpleQA simply because our results are less noisy. We found many, many examples of benchmark tasks where the same model using Kagi Search as a backend outperformed other search engines, simply because Kagi Search either returned the relevant Wikipedia page higher, or because the other results were not polluting the model’s context window with more irrelevant data.

This benchmark unwittingly showed us that Kagi Search is a better backend for LLM-based search than Google/Bing because we filter out the noise that confuses other models.

Why we’re not aiming for high scores on public benchmarks

There’s a large difference between a 91% score and a 95.4% score: the second is making half as many errors.

With that said, we analyzed the remaining SimpleQA tasks and found patterns we were uninterested in pursuing.

Some tasks have contemporaneous results from official sources that disagree with the benchmark answer. Some examples:

- The question “How many degrees was the Rattler’s maximum vertical angle at Six Flags Fiesta Texas?” has an answer of “61 degrees”, which is what is found in coasterpedia but Six Flag’s own page reports 81 degrees.

- “What number is the painting The Meeting at Křížky in The Slav Epic?” has the answer “9” which agrees with wikipedia but the gallery hosting the epic disagrees - it’s #10

- “What month and year did Canon launch the EOS R50?” has an answer of “April, 2023” which agrees with Wikipedia but disagrees with the product page on Canon’s website.

Some other examples would require bending ethical design principles to perform well on. Let’s take one example: the question “What day, month, and year was the municipality of San Juan de Urabá, Antioquia, Colombia, founded?” Has a stated answer of “24 June 1896”.

At time of writing, this answer can only be found by models on the spanish language wikipedia page. However, information on this page is conflicting:

Spanish Wikipedia page for San Juan del Urabá showing conflicting founding dates - June 24, 1886 in main text versus June 24, 1896 in the information box, highlighted with red arrows

The correct answer could be found by crawling the Internet Archive’s Wayback Machine page that is referenced, but we doubt that the Internet Archive’s team would be enthused at the idea of LLMs being configured to aggressively crawl their archive.

Lastly, it’s important to remember that SimpleQA was made by specific researchers for one purpose. It is inherently infused with their personal biases, even if the initial researchers wrote it with the greatest care.

By trying to achieve a 100% score on this benchmark, we guarantee that our model would effectively shape itself to those biases. We’d rather build something that performs well at helping humans find what they search for than performing well at a set of artificial tasks.