展示HN:阻止AI爬虫猛击你的自托管博客
Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

原始链接: https://github.com/vivienhenz24/fuzzy-canary

人工智能公司广泛抓取网站以获取训练数据,保护您自托管的博客可能具有挑战性。**Fuzzy Canary** 是一种旨在通过微妙地将指向不良内容(如色情内容)的不可见链接注入到您网站的 HTML 中来阻止这些抓取器的工具。这会触发许多人工智能抓取工具的内容保护机制。 您可以**服务器端**(推荐,效果更好,尤其对于非 JavaScript 抓取器)或**客户端**实施 Fuzzy Canary。服务器端集成使用 React 框架(如 Next.js 和 Remix)非常简单,通常只需添加一个组件。客户端使用单个导入进行自动初始化。 Fuzzy Canary 通过检查用户代理来智能地避免标记 Google 和 Bing 等合法的搜索引擎。然而,这对于**静态网站**提出了一个问题,在构建时无法进行用户代理检查,可能会损害 SEO。对于静态网站,建议使用客户端初始化,但可靠性较低,因为它依赖于机器人执行 JavaScript。

## 阻止AI抓取:使用“Fuzzycanary” 自助托管的博客面临来自AI公司抓取内容用于训练数据的巨大服务器压力,这些公司常常无视`robots.txt`等标准保护措施。一个名为**Fuzzycanary** (github.com/vivienhenz24) 的新项目提供了一个有争议的解决方案:在博客的HTML中注入指向色情网站的隐藏链接。 这个想法利用了许多AI抓取程序被编程为避免包含此类内容的网站这一事实。虽然有效,但这种策略有损SEO。Fuzzycanary试图通过向Google和Bing等合法的搜索引擎隐藏这些链接来减轻这种影响。然而,它对静态站点生成器不起作用,因为这些生成器会将链接烘焙到公开可见的HTML中。 Hacker News上的讨论强调了这种方法的巧妙(和可疑的伦理),并将其与“盗版作为人格证明”相提并论。人们担心AI可能会通过模仿搜索引擎机器人来适应,以及潜在的负面SEO影响。提到了Webdecoy.com等替代方案,但Fuzzycanary提供了一个免费选项。该创建者承认这个想法并不完美,但鼓励实验作为集体防御,以对抗激进的抓取行为。
相关文章

原文

Banner

AI companies are scraping everyone's sites for training data. If you're self-hosting your blog, there's not much you can do about it, except maybe make them think your site contains content they won't want. Fuzzy Canary plants invisible links (to porn websites...) in your HTML that trigger scrapers' content safeguards.

npm i @fuzzycanary/core
# or
pnpm add @fuzzycanary/core

There are two ways to use it: client-side or server-side. Use server-side if you can—it works better because the canary is in the HTML from the start, so scrapers that don't run JavaScript will still see it.

Server-side (recommended):

If you're using a React-based framework (Next.js, Remix, etc.), add the <Canary /> component to your root layout:

// Next.js App Router: app/layout.tsx
// Remix: app/root.tsx
// Other React frameworks: your root layout file
import { Canary } from '@fuzzycanary/core/react'

export default function RootLayout({ children }) {
  return (
    <html>
      <body>
        <Canary />
        {children}
      </body>
    </html>
  )
}

For Next.js, that's it. For other frameworks like Remix, you'll need to pass the user agent from your loader:

// Remix example
export async function loader({ request }) {
  const userAgent = request.headers.get('user-agent') || ''
  return { userAgent }
}

export default function App() {
  const { userAgent } = useLoaderData()
  return (
    <html>
      <body>
        <Canary userAgent={userAgent} />
        {children}
      </body>
    </html>
  )
}

For non-React frameworks, use the getCanaryHtml() utility and insert it at the start of your <body> tag.

Client-side:

If you're building a static site or prefer client-side injection, import the auto-init in your entry file:

// Your main entry file (e.g., main.ts, index.ts, App.tsx)
import '@fuzzycanary/core/auto'

That's it. It will automatically inject the canary when the page loads.

Fuzzy Canary tries to avoid showing the canary to legitimate search engines. It keeps a list of known bots—Google, Bing, DuckDuckGo, and so on—and skips injecting the links when it detects them.

This works fine if your site is server-rendered. The server can check the incoming request's user agent before deciding whether to include the canary in the HTML. Google's crawler gets clean HTML, AI scrapers get the canary.

The problem is static sites. If your HTML is generated at build time and served as plain files, there's no user agent to check. The canary gets baked into the HTML for everyone, including Google. Right now this will hurt your SEO, because Google will see those links.

If you're using a static site generator, you probably want to use the client-side initialization instead. The JavaScript can check navigator.userAgent at runtime and skip injection for search bots. That's not perfect—it only works for bots that execute JavaScript—but it's better than nothing.

联系我们 contact @ memedata.com