我将24年的博客文章输入到马尔可夫模型中。

我将24年的博客文章输入到马尔可夫模型中。
I fed 24 years of my blog posts to a Markov model

原始链接: https://susam.net/fed-24-years-of-posts-to-markov-model.html

苏萨姆·帕尔最近分享了“Mark V. Shaney Junior”，一个受80年代程序启发、极简的马尔可夫文本生成器，可在GitHub上找到。帕尔喜欢“探索性编程”——为娱乐和学习而创建小型程序，并经常在之前的实验基础上进行迭代。这个项目就是这样一次改进，经过完善后并附带了README文件分享。该程序通过分析输入并根据前面的序列（默认情况下为三元组）来预测下一个单词，从而生成文本。帕尔用《圣诞颂歌》和他的24年博客文章（约20万字）进行了测试，结果颇为有趣。博客文章输入产生的是常常不连贯，但有时却出人意料地相关的“胡言乱语”，反映了他写作的主题——Lisp、自尊和Emacs。提高模型的“阶数”（考虑的前面单词数量）可以提高连贯性，但也可能导致逐字引用。帕尔将这个简单的马尔可夫模型描述为“语言建模的你好，世界”，强调了它的简单性和作为学习工具的有效性。

## 马尔可夫模型与LLM：黑客新闻讨论总结一个黑客新闻帖子，源于用户将24年的博客文章输入马尔可夫模型，揭示了对文本生成技术的有趣探索。原作者试验了不同“阶数”的马尔可夫模型（字符、对、三元组）和字节对编码（BPE）分词，发现更高的阶数导致更连贯，但也更具确定性的输出。许多评论者分享了类似经验，回忆起过去使用马尔可夫链进行创意写作或聊天机器人创建的项目。讨论迅速转向将马尔可夫模型与现代大型语言模型（LLM）进行比较。虽然从技术上讲，LLM *可以* 被定义为具有极大量状态的马尔可夫链，但许多人认为这种定义过于宽泛，并失去了传统马尔可夫模型的核心原理。争论的中心在于LLM是否真正体现马尔可夫属性（未来状态仅依赖于当前状态），还是维持某种形式的“记忆”或可变上下文。几位用户指出LLM处理长程依赖的能力，这是简单马尔可夫链难以做到的。社区成员分享了使用OpenAI的API在个人写作上训练LLM的实用建议，并提供了实验马尔可夫模型的资源。

原文

By Susam Pal on 13 Dec 2025

Yesterday I shared a little program called the Mark V. Shaney Junior at github.com/susam/mvs. It is a minimal implementation of a Markov text generator inspired by the legendary Mark V. Shaney program from the 1980s. If you don't know about Mark V. Shaney, read more about it on the Wikipedia article Mark V. Shaney.

It is a very small program that favours simplicity over efficiency. As a hobby, I often engage in exploratory programming where I write computer programs not to solve a specific problem but simply to explore a particular idea or topic for the sole purpose of recreation. I must have written small programs to explore Markov chains for various kinds of state spaces over a dozen times by now. Every time, I just pick my last experimental code and edit it to encode the new state space I am exploring. That's usually my general approach to such one-off programs. I have hundreds of tiny little experimental programs lying on my disk at any given time.

Once in a while, I get the itch to take one of those exploratory programs, give it some finishing touches, wrap it up in a nice Git repo along with a README.md, CHANGES.md and the whole shebang and share it on github.com/susam and codeberg.org/susam. The Mark V. Shaney Jnr program that I shared yesterday happened to be one such exercise.

If you scroll down the README of this project, you'll find some nice examples of the gibberish produced by this program. The first few examples there are the result of training the model on A Christmas Carol by Charles Dickens, one of my favourite authors. It is often said that Dickens never used fewer words when more would suffice. So I thought there couldn't be a better piece of text when it comes to testing out my tiny Markov model. I'll not reproduce the generated text examples here for the sake of brevity. If you are interested to take a look, just head over to the Gibberish Examples section of the README.

Soon after sharing the project, I wondered what kind of gibberish it would produce if I fed all 24 years of my blog posts and pages into the program. Well, here's one of the results:

$ python3 mvs.py < susam.txt
while a query replace operation is approved by the user. The above
variable defines the build job. It can be incredibly useful while
working on assembly language and machine code. In fact, all internal
resources like the result to refine the search prompt changes from
bck-i-search: to fwd-i-search:. Now type C-SPC (i.e. ctrl+space) to
set a mark causes Emacs to use 32-bit registers like EBP, ESP,
etc. Thus the behaviour is undefined. Such code may behave differently
when compiled with the readily available GNU tools like the shape
of 8. Flipping "P" horizontally makes it a proper quine: cat $0

This is the text that comes out after the program consumes over 200 posts consisting of about 200,000 words. Here is another example:

$ python3 mvs.py < susam.txt
enjoy asking "what happens if" and then type M-x zap-up-to-char RET
b. The buffer for this specific video, the actual fare for 8.3 km and
11 are all written from scratch. No prior knowledge is expected to
slow down in future. For now, I will add a statement like x =
0.999... , the inner corner square as discussed in the code segment
into the REPL window. Unlike Slimv, Vlime can work with and debug
executable files, it can be more convenient. M-x: Execute Extended
Command The key sequence M-q invokes the command cat and type TAB to
indent the

Here is a particularly incoherent but amusing one:

$ python3 mvs.py < susam.txt
Then open a new Lisp source file and the exact answer could harm
students' self-esteem. Scientists have arbitrarily assumed that an
integral domain. However, the string and comment text. To demonstrate
how a build job can trigger itself, pass input to standard output or
standard error), Eshell automatically runs the following command in
Vim and Emacs will copy the message length limit of 512 characters,
etc. For example, while learning to play the game between normal mode
to move the point is on an old dictionary lying around our house and
that is moving to the small and supportive community

No, I have never written anywhere that opening a Lisp source file could harm anyone's self-esteem. The text generator has picked up the 'Lisp source file' phrase from my Lisp in Vim post and the 'self-esteem' bit from the From Perl to Pi post.

By default, this program looks at trigrams (all sequences of three adjacent words) and creates a map where the first two words of the trigram are inserted as the key and the third word is appended to its list value. This map is the model. In this way, the model captures each pair of adjacent words along with the words that immediately follow each pair. The text generator then chooses a key (a pair of words) at random and looks for a word which follows. If there are multiple followers, it picks one at random. That is pretty much the whole algorithm. There isn't much more to it. It is as simple as it gets. For that reason, I often describe a simple Markov model like this as the 'hello, world' of language modelling.

The number of words in the key of the map can be set via command line arguments. By default, it is 2 as described above. This value is also known as the order of the model. So by default the order is 2. If we increase it to, say, 3 or 4, the generated text becomes a little more coherent. Here is one such example:

$ python3 mvs.py 4 < susam.txt
It is also possible to search for channels by channel names. For
example, on Libera Chat, to search for all channels with "python" in
its name, enter the IRC command: /msg alis list python. Although I
have used Libera Chat in the examples above, there are plenty of
infinite fields, so they must all be integral domains too. Consider
the field of rational numbers Q . Another quick way to arrive at this
fact is to observe that when one knight is placed on a type D square,
only two positions for the second knight such that the two knights

Except for a couple of abrupt transitions, the text is mostly coherent. We need to be careful about not increasing the order too much. In fact, if we increase the order of the model to 5, the generated text becomes very dry and factual because it begins to quote large portions of the blog posts verbatim. Not much fun can be had like that.

Before I end this post, let me present one final example where I ask it to generate text from an initial prompt:

$ python3 mvs.py 2 100 'Finally we'
Finally we divide this number by a feed aggregrator for Emacs-related
blogs. The following complete key sequences describe the effects of
previous evaluations shall have taken a simple and small to contain
bad content. This provides an interactive byte-compiled Lisp function
in MATLAB and GNU bash 5.1.4 on Debian is easily reproducible in
Windows XP. Older versions might be able to run that server for me it
played a significant burden on me as soon as possible. C-u F: Visit
the marked files or directories in the sense that it was already
initiated and we were to complete the proof.

Apparently, this is how I would sound if I ever took up speaking gibberish!

我将24年的博客文章输入到马尔可夫模型中。 I fed 24 years of my blog posts to a Markov model

我将24年的博客文章输入到马尔可夫模型中。
I fed 24 years of my blog posts to a Markov model