困住行为不端AI机器人的人工智能迷宫
Trapping misbehaving bots in an AI Labyrinth

原始链接: https://blog.cloudflare.com/ai-labyrinth/

Cloudflare 推出了 AI Labyrinth,这是一项新颖的安全功能,它利用 AI 生成内容来对抗未经授权的 AI 爬虫和机器人。用户选择加入后,可以部署一个由 AI 生成的相互链接的页面网络,这些页面看起来合法,但包含无关信息。这会浪费爬虫的资源并减慢数据抓取速度,而不会直接阻止它们,从而避免触发自适应行为。 AI Labyrinth 也充当一种复杂的蜜罐。这些隐藏的链接旨在被机器人点击,因为真实用户不会与 AI 生成内容互动,这使得 Cloudflare 可以识别和识别恶意机器人,从而改进其机器人检测系统。 AI 生成的内容使用 Workers AI 和开源模型预先生成,确保其多样化且无害。这些链接与现有页面无缝集成,不会影响用户体验或 SEO。AI Labyrinth 可供所有客户使用,包括免费计划的用户,只需在 Cloudflare 仪表板中启用一个简单的开关即可。

Hacker News 上的一篇讨论围绕着 Cloudflare 的 AI 迷宫展开,这是一种用来捕捉恶意机器人的方法。评论者们对它对使用屏幕阅读器的用户的影响表示担忧,尤其是一些因为隐私设置而被错误标记为机器人的用户。一些用户质疑向机器人提供错误信息(即使这些信息在事实上是准确的)以阻止它们抓取网站和传播虚假信息的伦理问题。另一些人则思考如何识别和阻止那些行为非常像人类的机器人,以及受此技术保护的页面是否需要频繁进行图灵测试。一位评论者指出,如果爬虫无法检测到这些陷阱,那么“迷宫”中的错误信息可能会被归咎于原始网站。这场讨论探讨了机器人检测、用户体验和在线信息完整性之间的平衡。

原文

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.

Using Generative AI as a defensive weapon

AI-generated content has exploded, reportedly accounting for four of the top 20 Facebook posts last fall. Additionally, Medium estimates that 47% of all content on their platform is AI-generated. Like any newer tool it has both wonderful and malicious uses.

At the same time, we’ve also seen an explosion of new crawlers used by AI companies to scrape data for model training. AI Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see. While Cloudflare has several tools for identifying and blocking unauthorized AI crawling, we have found that blocking malicious bots can alert the attacker that you are on to them, leading to a shift in approach, and a never-ending arms race. So, we wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted.

To do this, we decided to use a new offensive tool in the bot creator’s toolset that we haven’t really seen used defensively: AI-generated content. When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them. But while real looking, this content is not actually the content of the site we are protecting, so the crawler wastes time and resources. 

As an added benefit, AI Labyrinth also acts as a next-generation honeypot. No real human would go four links deep into a maze of AI-generated nonsense. Any visitor that does is very likely to be a bot, so this gives us a brand-new tool to identify and fingerprint bad bots, which we add to our list of known bad actors. Here’s how we do it…

How we built the labyrinth 

When AI crawlers follow these links, they waste valuable computational resources processing irrelevant content rather than extracting your legitimate website data. This significantly reduces their ability to gather enough useful information to train their models effectively.

To generate convincing human-like content, we used Workers AI with an open source model to create unique HTML pages on diverse topics. Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval. We found that generating a diverse set of topics first, then creating content for each topic, produced more varied and convincing results. It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

This pre-generated content is seamlessly integrated as hidden links on existing pages via our custom HTML transformation process, without disrupting the original structure or content of the page. Each generated page includes appropriate meta directives to protect SEO by preventing search engine indexing. We also ensured that these links remain invisible to human visitors through carefully implemented attributes and styling. To further minimize the impact to regular visitors, we ensured that these links are presented only to suspected AI scrapers, while allowing legitimate users and verified crawlers to browse normally.

A graph of daily requests over time, comparing different categories of AI Crawlers.

A graph of daily requests over time, comparing different categories of AI Crawlers.

What makes this approach particularly effective is its role in our continuously evolving bot detection system. When these links are followed, we know with high confidence that it's automated crawler activity, as human visitors and legitimate browsers would never see or click them. This provides us with a powerful identification mechanism, generating valuable data that feeds into our machine learning models. By analyzing which crawlers are following these hidden pathways, we can identify new bot patterns and signatures that might otherwise go undetected. This proactive approach helps us stay ahead of AI scrapers, continuously improving our detection capabilities without disrupting the normal browsing experience.

By building this solution on our developer platform, we've created a system that serves convincing decoy content instantly while maintaining consistent quality - all without impacting your site's performance or user experience.

How to use AI Labyrinth to stop AI crawlers

Enabling AI Labyrinth is simple and requires just a single toggle in your Cloudflare dashboard. Navigate to the bot management section within your zone, and toggle the new AI Labyrinth setting to on:

Once enabled, the AI Labyrinth begins working immediately with no additional configuration needed.

AI honeypots, created by AI

The core benefit of AI Labyrinth is to confuse and distract bots. However, a secondary benefit is to serve as a next-generation honeypot. In this context, a honeypot is just an invisible link that a website visitor can’t see, but a bot parsing HTML would see and click on, therefore revealing itself to be a bot. Honeypots have been used to catch hackers as early as the late 1986 Cuckoo’s Egg incident. And in 2004, Project Honeypot was created by Cloudflare founders (prior to founding Cloudflare) to let everyone easily deploy free email honeypots, and receive lists of crawler IPs in exchange for contributing to the database. But as bots have evolved, they now proactively look for honeypot techniques like hidden links, making this approach less effective.

AI Labyrinth won’t simply add invisible links, but will eventually create whole networks of linked URLs that are much more realistic, and not trivial for automated programs to spot. The content on the pages is obviously content no human would spend time-consuming, but AI bots are programmed to crawl rather deeply to harvest as much data as possible. When bots hit these URLs, we can be confident they aren’t actual humans, and this information is recorded and automatically fed to our machine learning models to help improve our bot identification. This creates a beneficial feedback loop where each scraping attempt helps protect all Cloudflare customers.

This is only the first iteration of using generative AI to thwart bots for us. Currently, while the content we generate is convincingly human, it won’t conform to the existing structure of every website. In the future, we’ll continue to work to make these links harder to spot and make them fit seamlessly into the existing structure of the website they’re embedded in. You can help us by opting in now.

To take the next step in the fight against bots, opt-in to AI Labyrinth today.

联系我们 contact @ memedata.com