大型语言模型的权重是一段历史。

大型语言模型的权重是一段历史。
Big LLMs weights are a piece of history

萨尔瓦多·桑菲利波（antirez）悲叹网络内容的消失，强调互联网档案在保存现代历史方面的关键作用。他指出了各种材料的丢失，从早期的编程讨论和90年代的亚文化到个人博客、科学论文、数字艺术和气候数据。在倡导继续支持互联网档案和类似的倡议的同时，他也承认由于资金限制，不可能保存所有内容。他提出利用大型语言模型（LLM）的数据压缩能力作为一个可行，尽管不完美，的解决方案。他认为像DeepSeek V3这样的LLM是一种有损但有价值的“互联网压缩视图”。他的主要建议是确保LLM权重的持续公开可用性，并将互联网档案的内容纳入其预训练数据集，这提供了一种至少保留丢失在线信息提炼表示的方法。

这个Hacker News帖子讨论了大型语言模型（LLM）权重作为历史文物的想法。讨论源于一篇认为LLM捕捉了人类知识快照的文章。评论者们就不同大小的LLM的命名约定展开了辩论，建议从星巴克的尺寸到服装尺寸不等。一个关键主题是，尽管LLM已经是“大型语言模型”，但它们还需要进一步区分大小。“大型LLM”（Big LLMs）可能是可行的术语，这一点得到了许多用户的认同。一些评论深入探讨了LLM作为互联网数据的有损压缩及其对未来历史学家的潜在价值。帖子还讨论了互联网档案（Internet Archive）及其保存数据使命。用户们就这些数据的价值以及是否应该保存所有数据展开了辩论。帖子进一步探讨了这些模型主要体现的是人工智能还是人工记忆。讨论非常广泛，涵盖了命名、分类以及LLM对知识保存和获取的更广泛影响。

（评论） 2024-02-16

（评论） 2023-11-16

（评论） 2024-04-19

如果值得保留，请将其保存在Markdown中 2025-02-27

原文

antirez 14 hours ago. 49396 views.

By multiple accounts, the web is losing pieces: every year a fraction of old web pages disappear, lost forever. We should regard the Internet Archive as one of the most valuable pieces of modern history; instead, many companies and entities make the chances of the Archive to survive, and accumulate what otherwise will be lost, harder and harder. I understand that the Archive headquarters are located in what used to be a church: well, there is no better way to think of it than as a sacred place.

Imagine the long hours spent by old programmers hacking with the Z80 assembly on their Spectrums. All the discussions about the first generation of the Internet. The subcultures that appeared during the 90s. All things that are getting lost, piece by piece.

And what about the personal blogs? Pieces of life of single individuals that dumped part of their consciousness on the Internet. Scientific papers and processes that are lost forever as publishers fail, their websites shut down. Early digital art, video games, climate data once published on the Internet and now lost, and many sources of news, as well.

This is a known issue and I believe that the obvious approach of trying to preserve everything is going to fail, for practical reasons: a lot of efforts for zero economic gains: the current version of the world is not exactly the best place to make efforts that cost a lot of money and don't pay money. This is why I believe that the LLMs' ability to compress information, even if imprecise, hallucinated, lacking, is better than nothing. DeepSeek V3 is already an available, public lossy compressed view of the Internet, as other very large state of-art models are.

This will not bring back all the things we are losing, and we should try hard supporting The Internet Archive and other similar institutions and efforts. But, at the same time, we should focus on a much simpler effort: to make sure that the weights of LLMs publicly released do not get lost, and also to make sure that the Archive is part of the pre-training set as well.

大型语言模型的权重是一段历史。 Big LLMs weights are a piece of history

大型语言模型的权重是一段历史。
Big LLMs weights are a piece of history