任何人都可以联系 OpenAI。 他们有蜘蛛问题
Anyone got a contact at OpenAI. They have a spider problem

原始链接: https://mailman.nanog.org/pipermail/nanog/2024-April/225407.html

IECC.com 的 John Levine 寻求 OpenAI 的帮助,因为他们的 AI 模型(特别是 GPTBot)存在问题,尽管其重复性和缺乏价值,但 GPTBot 仍不断访问他的内容农场。 该内容场位于 [https://www.web.sp.am/],包含分布在众多网站上的数十亿个几乎相同的页面。 据报道,GPTBot 在一天内获取了超过 300 万个页面,其中包括 180 万个 robots.txt 文件请求。 需要澄清的是,这不是一个大型网站;而是一个大型网站。 相反,它由 68 亿个小网站组成,每个网站只有一个页面。 尽管其规模和明显的目的,亚马逊的蜘蛛之前在那里遇到过问题,需要干预。 John 询问是否有人可以为他提供与 OpenAI 团队成员的联系,以进一步讨论此事。

本文讨论虚构的概念,即个人根据过去的经历创造错误的记忆或事实。 演讲者分享了他们与临床谈话者的个人经历,并将其与 YouTube 上的儿童内容进行了比较。 他们认为,虽然孩子们可以说他们不知道某些事情,但他们也可以自信地错了,这是人类尚未弄清楚如何应对的挑战。 他们提到了哲学家路德维希·维特根斯坦和他的著作《逻辑哲学论》和《哲学研究》,并暗示维特根斯坦可能将幻觉视为语言使用的一部分,尽管这两本书在这个问题上相互矛盾。 然后,演讲者将话题转向人工智能,讨论模型训练期间的故障标记以及某些网站行为背后的潜在经济激励。 他们对 OpenAI 服务器群行为背后的动机表示怀疑,并批评用于模型训练的数据来源缺乏透明度。 此外,他们还触及了维特根斯坦对语言的看法,并提出了理解语言的“法国”方法。
相关文章

原文
Anyone got a contact at OpenAI. They have a spider problem. John Levine johnl at iecc.com
Thu Apr 11 01:10:57 UTC 2024
As I think I have mentioned before, I have the world's lamest content farm
at https://www.web.sp.am/.  Click on a link or two and you'll get the idea.

Unfortunately, GPTBot has found it and has not gotten the idea. It has
fetched over 3 million pages today. Before someone tells me to fix my
robots.txt, this is a content farm so rather than being one web site
with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one
page. Of those 3 million page fetches, 1.8 million were for robots.txt.

It's not like it's hard to figure out what's going on since the pages
all look nearly the same, and they're all on the same IP address with
the same wildcard SSL certificate.

Amazon's spider got stuck there a month or two ago but fortunately I was
able to find someone to pass the word and it stopped.  Got any contacts
at OpenAI?

R's,
John

PS: If you were wondering what they're using to train GPT-5, well, now you know.


More information about the NANOG mailing list
联系我们 contact @ memedata.com