计算机科学家发明了一种高效的新计数方法

计算机科学家发明了一种高效的新计数方法
Computer scientists invent an efficient new way to count

原始链接: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

在《哈姆雷特》实验中，您在表演过程中使用的白板容量有限，只能容纳 100 个单词。最初，记录前 100 个不同的单词，通过抛硬币消除重复。在第一轮的初始过滤后保留大约 50 个单词。在整个哈姆雷特的对话中，添加新单词，同时通过持续抛硬币删除重复的单词。执行这些步骤，直到到达填满的棋盘，然后根据硬币结果删除大约一半。每个单词由于其在序列中的位置而保持相同的停留可能性 - 1/2k，其中“k”代表特定的轮数。尽管文本的长度或复杂性不同，该方法仍可确保精确的估计。它的准确性随着内存的增加而提高。例如，当记住的《哈姆雷特》台词完成后达到 61 个单词时，除以 1/26 后会发现估计有 3,904 个不同的单词，与剧本中 3,967 个不同单词的实际计数密切相关。

在这个称为“哈姆雷特算法”的过程中，您将逐渐构建莎士比亚戏剧《哈姆雷特》中的独特单词列表。当遇到重复的单词时，您可以通过掷硬币的方式随机决定是删除它还是将其保留在列表中。这是其工作原理的简单解释： 1.《哈姆雷特》中的每个单词都被分配了一个 0 到 1 之间的随机数。 2.仅存储最近出现次数低于特定阈值的单词。最初，由于内存容量不足，所有单词都被存储。然而，一旦超过限制，一些单词必须根据它们相对于阈值的当前位置被删除。 3. 单词的保留或驱逐取决于其最近出现次数与阈值相比的值。阈值越低，留下的单词就越少。 4. 当考虑实数而不是实际单词时，结果不会改变——单词是否存在于内存中仅依赖于它之前的实例。 5. 通过调整阈值，可以控制记忆中不同元素的数量，而无需记住单个单词本身。本质上，哈姆雷特算法通过测量已填充的不同内存槽与最大可能槽数的比率来估计唯一字的总数。以前的方法通常通过哈希函数进行评估。相反，上面介绍了哈姆雷特算法的“简化版本”，利用更简单的方法，而不专门使用哈希函数。尽管具有创新性和吸引力，但哈希函数和非基于哈希的版本都不是特别新颖或突破性的，它们仅仅代表了实现相同目标的替代方法。

原文

Let’s return to Hamlet, but this time your working memory — consisting of a whiteboard — has room for just 100 words. Once the play starts, you write down the first 100 words you hear, again skipping any repeats. When the space is full, press pause and flip a coin for each word. Heads, and the word stays on the list; tails, and you delete it. After this preliminary round, you’ll have about 50 distinct words left.

Now you move forward with what the team calls Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

Next, move to Round 2. Continue as in Round 1, only now we’ll make it harder to keep a word. When you come to a repeated word, flip the coin again. Tails, and you delete it, as before. But if it comes up heads, you’ll flip the coin a second time. Only keep the word if you get a second heads. Once you fill up the board, the round ends with another purge of about half the words, based on 100 coin tosses.

In the third round, you’ll need three heads in a row to keep a word. In the fourth round you’ll need four heads in a row. And so on.

Eventually, in the kth round, you’ll reach the end of Hamlet. The point of the exercise has been to ensure that every word, by virtue of the random selections you’ve made, has the same probability of being there: 1/2^k. If, for instance, you have 61 words on your list at the conclusion of Hamlet, and the process took six rounds, you can divide 61 by the probability, 1/2⁶, to estimate the number of distinct words — which comes out to 3,904 in this case. (It’s easy to see how this procedure works: Suppose you start with 100 coins and flip each one individually, keeping only those that come up heads. You’ll end up with close to 50 coins, and if someone divides that number by the probability, ½, they can guess that there were about 100 coins originally.)

Chakraborty, Variyam and Meel mathematically proved that the accuracy of this technique scales with the size of the memory. Hamlet has exactly 3,967 unique words. (They counted.) In experiments using a memory of 100 words, the average estimate after five runs was 3,955 words. With a memory of 1,000 words, the average improved to 3,964. “Of course,” Variyam said, “if the [memory] is so big that it fits all the words, then we can get 100% accuracy.”

“This is a great example of how, even for very basic and well-studied problems, there are sometimes very simple but non-obvious solutions still waiting to be discovered,” said William Kuszmaul of Harvard University.

计算机科学家发明了一种高效的新计数方法 Computer scientists invent an efficient new way to count

计算机科学家发明了一种高效的新计数方法
Computer scientists invent an efficient new way to count