瘴气:一种让AI网络爬虫陷入无尽陷阱的工具
Miasma: A tool to trap AI web scrapers in an endless poison pit

原始链接: https://github.com/austin-weeks/miasma

## 使用 Miasma 反击 AI 爬取 AI 公司正在积极地爬取互联网以获取训练数据,可能未经许可使用您网站的内容。**Miasma** 是一款旨在通过向爬虫提供“中毒”数据来扰乱这种做法的工具。 Miasma 的工作原理是在被爬虫访问时,持续提供误导性的链接和损坏的信息。 您将其部署在服务器上(可以使用 Cargo 或预构建的二进制文件轻松部署),并通过隐藏链接和 Nginx 等反向代理将爬虫流量引导至此。 配置很简单:指定一个唯一的路径(例如 `/bots`)并限制并发连接以管理资源使用。 重要的是,通过您的 `robots.txt` 文件排除合法的机器人(例如 Googlebot)。 Miasma 轻量级且高效,旨在最大限度地减少对服务器的影响,同时有效地浪费数据饥渴的 AI 公司的资源。 它是一种主动防御不受欢迎的数据收集行为。

名为“Miasma”的新工具已在GitHub上发布,由Austin Weeks设计,旨在阻止人工智能网络爬虫。正如Hacker News上讨论的那样,Miasma本质上创建了一个“毒阱”,让爬虫陷入无休止、无用的内容中。 该工具的发布引发了关于人工智能抓取伦理的讨论。一位用户表达了对监管的需求,要求营利性爬虫向网站所有者识别自己,认为许多人工智能公司无视网站所有者的意愿,并从免费共享的知识中获利。 “Miasma”这个名字本身引用了历史上疾病是由不良空气引起的信念,恰当地暗示该工具旨在破坏人工智能模型的“健康”(训练数据),尽管其有效性尚未得到证实。
相关文章

原文

No AI crates.io downloads Crate Dependencies Checks GitHub commits since latest release

AI companies continually scrape the internet at an enormous scale, swallowing up all of its contents to use as training data for their next models. If you have a public website, they are already stealing your work.

Miasma is here to help you fight back! Spin up the server and point any malicious traffic towards it. Miasma will send poisoned training data from the poison fountain alongside multiple self-referential links. It's an endless buffet of slop for the slop machines.

Miasma is very fast and has a minimal memory footprint - you should not have to waste compute resources fending off the internet's leeches.

Sample response from Miasma.

Install with cargo (recommended):

Or, download a pre-built binary from releases.

Start Miasma with default configuration:

View all available configuration options:

Let's walk through an example of setting up a server to trap scrapers with Miasma. We'll pick /bots as our server's path to direct scraper traffic. We'll be using Nginx as our server's reverse proxy, but the same result can be achieved with many different setups.

When we're done, scrapers will be trapped like so:

Flow chart depicting cycle of trapped scrapers.

Within our site, we'll include a few hidden links leading to /bots.

<a href="/bots" style="display: none;" aria-hidden="true" tabindex="1">
  Amazing high quality data here!
</a>

The style="display: none;", aria-hidden="true", and tabindex="1" attributes ensure links are totally invisible to human visitors and will be ignored by screen readers and keyboard navigation. They will only be visible to scrapers.

Configuring our Nginx Proxy

Since our hidden links point to /bots, we'll configure this path to proxy Miasma. Let's assume we're running Miasma on port 9855.

location ~ ^/bots($|/.*)$ {
  proxy_pass http://localhost:9855;
}

This will match all variations of the /bots path -> /bots, /bots/, /bots/12345, etc.

Lastly, we'll start Miasma and specify /bots as the link prefix. This instructs Miasma to start links with /bots/, which ensures scrapers are properly routed through our Nginx proxy back to Miasma.

We'll also limit the number of max in-flight connections to 50. At 50 connections, we can expect 50-60 MB peak memory usage. Note that any requests exceeding this limit will immediately receive a 429 response rather than being added to a queue.

miasma --link-prefix '/bots' -p 9855 -c 50

Let's deploy and watch as multi-billion dollar companies greedily eat from our endless slop machine!

Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
User-agent: Slurp
User-agent: SomeOtherNiceBot
Disallow: /bots
Allow: /

Miasma can be configured via its CLI options:

Option Default Description
port 9999 The port the server should bind to.
host localhost The host address the server should bind to.
max-in-flight 500 Maximum number of allowable in-flight requests. Requests received when in flight is exceeded will receive a 429 response. Miasma's memory usage scales directly with the number of in-flight requests - set this to a lower value if memory usage is a concern.
link-prefix / Prefix for self-directing links. This should be the path where you host Miasma, e.g. /bots.
link-count 5 Number of self-directing links to include in each response page.
force-gzip false Always gzip responses regardless of the client's Accept-Encoding header. Forcing compression can help reduce egress costs.
poison-source https://rnsaffn.com/poison2/ Proxy source for poisoned training data.

Contributions are welcome! Please open an issue for bugs reports or feature requests. Primarily AI-generated contributions will be automatically rejected.

联系我们 contact @ memedata.com