Waveloop:Fable留给我的东西
Waveloop: What Fable left me

原始链接: https://neynt.ca/writing/waveloop/

Waveloop 是一款旨在通过色轮揭示音乐谐波与旋律结构的音乐可视化工具。它采用十二平均律,将音高类别映射到圆形界面上,并利用 Oklch 色彩空间,以堆叠直方图的形式呈现八度音阶。这使得用户能够通过角度识别音程,并通过独特的几何形状辨别和弦性质。 Waveloop 由 Fable 5 AI 辅助开发,具备用于预计算曲目的离线模式,以及能够进行实时和弦检测的在线模式。作者强调了 AI 生成代码的高效性与高密度,将其风格比作“纯粹”编程那种精确且信息密集的特质。此外,作者还详细介绍了利用 AI 制作配套讲解视频的迭代过程,指出特定的提示词是如何将平庸的初稿转化为精致且引人入胜的教学内容的。通过将恒等 Q 变换(CQT)和 Alpha 预乘等深奥的技术概念与直观、美观的界面相结合,Waveloop 将复杂的数字信号处理转化为一种直观的视觉体验,向音乐理论的数学基础致敬。

这篇 Hacker News 帖子讨论了最近的项目“Waveloop”——一款用于音乐播放软件的频谱分析器。讨论的主要内容包括: * **Fable 的影响:** 用户深切怀念已停用的 AI 工具“Fable”,认为它解决了长期存在的复杂漏洞,并为棘手的技术问题提供了“数学上的具体性”。 * **“奇异时代”:** 评论者反思了现代 AI 的超现实本质,指出 AI 工具能生成高度具体的内容(从 3Blue1Brown 风格的数学讲解到古怪的旁白),这引发了关于 AI 生成内容兴起的广泛讨论。 * **技术争论:** 讨论涉及了音乐理论(解释了十二平均律的 ¹²√2 比率)以及频谱可视化器的技术实现。一些用户质疑 Waveloop 可视化器的创新性,将其与 Milkdrop 等经典工具进行比较,并建议通过对文件进行预分析来获得更好的长篇视觉结构。 * **其他:** 帖子简要讨论了科技行业的职业发展,用户们开玩笑地探讨了“L5”资深工程师里程碑的重要性。
相关文章

原文

Over the two days we had Fable 5, it made me a music visualizer. This is the realization of something I have daydreamed about for as long as I can remember.

You can see it here: Waveloop


The idea is that a music visualizer should viscerally reveal the harmonic and melodic structure of the music. Most visualizers fail to do this — you get a vague sense of loudness, and maybe the bass/treble split, but that's it.

How can we do better? As we all know, the foundation of Western diatonic music theory is ¹²√2, the ratio between the frequencies of successive semitones. (I ignore other temperaments; they are all close enough to 12-TET.) Twelve of these takes you to the next octave, and notes that are a whole number of octaves apart are considered to be in the same pitch class.

Waveloop captures this cyclic structure in a chromatic circle, 30° per semitone, one revolution per octave. Any instant in the music is captured as a spiral stacked histogram, showing you how much of each pitch class is present. The layers of the histogram are different colors capturing different octaves: muted blues and greens for the bass, fiery orange and red and violet for mid-tones, and sparkly gold and sky for treble, tracing a spiral through oklch.

This representation has some nice properties:

You can read intervals simply as angles. Here are the intervals:

m230°
M260°
m390°
M3120°
P4150°
TT180°
P5210°
m6240°
M6270°
m7300°
M7330°

You can tell the quality of a chord from its shape. Transposing rotates the shape; inversion leaves it unchanged. Here are some common chord qualities:

maj0 · 4 · 7
min0 · 3 · 7
dim0 · 3 · 6
aug0 · 4 · 8
sus40 · 5 · 7
sus20 · 2 · 7
dom70 · 4 · 7 · 10
maj70 · 4 · 7 · 11
min70 · 3 · 7 · 10

Waveloop primarily operates on an offline basis, where it precomputes a CQT for a particular track, but Fable also gave me a live mic mode. When I turn it on, I find that it's able to identify ukulele chords I play pretty quickly and reliably.


We've been without Fable for about week now, and to remind myself of what once was, I took a look at some of the waveloop code.

The thing that struck me first is that it is dense. While previous models wrote code like a perfectly reasonable upwardly mobile engineer at a FAANG who is on their way to receiving a steady stream of promotions until they cap out at L5, this model writes more like how I'd imagine Terry Davis would have written code alone in his room.

Take a look at this comment at the top of the waveloop file. It wastes no words describing in obvious terms the code it just wrote. The comments seem more like maximally information dense recordings of intent, lockfiles from which something resembling the rest of the code could in principle be derived.

/* The visualizer is a pitch-class wheel: angle = fract(log2(f / 440)), so
   every octave of a note lands on the same spoke (A at 12 o'clock, ascending
   clockwise). The CPU keeps ~5 seconds of per-register-band emission history
   and rasterizes it every frame into the RGBA radial trail map sampled here
   (REGS vertically stacked blocks, T axis = radius; rgb = premultiplied
   register color with fade baked in, a = faded energy): each history row
   sits at the radius its own stored amplitude has carried it to, so motion
   is amplitude-driven - the loudest components shoot across the whole window
   while quiet accompaniment and noise linger near the ring, and the main
   line visually outruns everything else (see rasterTrails below).

   Color is continuous Oklch, computed CPU-side per FFT bin (hue encodes
   absolute frequency on a log scale, red at 20 Hz to violet at 20 kHz;
   lightness climbs the register axis - dark bass, fully saturated mids around
   common fundamentals, pale sparkly treble).

   Display energies live in 0..EMAX (loud fundamentals overshoot 1 instead
   of clipping at the old AGC ceiling); the trail map stores sqrt(v / EMAX)
   in alpha (and rgb premultiplied by that encoded alpha) so the u8 texture
   keeps low-end precision while carrying the extra headroom.

   Because the bands stay separate all the way to the screen, a pitch class
   sounding in several octaves renders as a stacked histogram on the rim
   (low register innermost), with color gliding continuously through the
   register ramp up the stack (u_rim carries the inverse CDF of each angle's
   register distribution) instead of cutting between a few band colors;
   register lives in the stack position and hue, never in the speed.

   The field extends past the farthest screen corner, and radius is a concave
   function of age, so material surges off the rim and decelerates as it drifts
   outward. */

The writing is deeply technical. This model doesn't shy away from drawing upon all its knowledge. It casually refers to alpha premultiplication and fundamental frequencies in the same breath. It is fond of acronyms. CDF, FFT, AGC. I can barely keep up.

The writing is also literary. It draws an analogy between the 12 musical pitch classes and the 12 markings on a clock. Noise lingers. Material surges off the rim. Fable doesn't shy away from using its entire vocabulary to tightly and vividly capture whatever it is it is trying to say.

Here is its function for chord detection. It seems thoroughly solid, and it's kind of surprising how little code it is.

const NOTE_NAMES = ['A', 'A#', 'B', 'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#'];

const QUALITIES = [
  { name: '',     ivs: [0, 4, 7] },
  { name: 'm',    ivs: [0, 3, 7] },
  { name: 'dim',  ivs: [0, 3, 6] },
  { name: 'aug',  ivs: [0, 4, 8] },
  { name: 'sus4', ivs: [0, 5, 7] },
  { name: 'sus2', ivs: [0, 2, 7] },
  { name: '7',    ivs: [0, 4, 7, 10] },
  { name: 'maj7', ivs: [0, 4, 7, 11] },
  { name: 'm7',   ivs: [0, 3, 7, 10] },
];

function detectChord() {
  let total = 0;
  for (let i = 0; i < 12; i++) total += chroma[i];
  chromaAgc = Math.max(chromaAgc * 0.995, total, 1e-6);
  if (total < 0.15 * chromaAgc || chromaAgc < 1e-3) return null;

  const c = new Array(12);
  for (let i = 0; i < 12; i++) c[i] = chroma[i] / total;

  let best = null, bestScore = 0;
  for (let root = 0; root < 12; root++) {
    for (const q of QUALITIES) {
      let inS = 0;
      for (let k = 0; k < q.ivs.length; k++) {
        inS += c[(root + q.ivs[k]) % 12] * (k === 0 ? 1.15 : 1);
      }
      const score = inS / Math.pow(q.ivs.length, 0.55);
      if (score > bestScore) { bestScore = score; best = { root, q }; }
    }
  }
  if (!best) return null;
  let frac = 0;
  for (const iv of best.q.ivs) frac += c[(best.root + iv) % 12];
  if (frac < 0.5) return null;   // too much energy outside the chord tones
  return {
    name: NOTE_NAMES[best.root] + best.q.name,
    root: best.root,
    pcs: best.q.ivs.map((iv) => (best.root + iv) % 12),
  };
}

I also had Fable make an explainer video.

This was three prompts. My first prompt was this:

ok fuck it let's ball.

let's also make a manim-based video explaining the mathematical principles
behind waveloop, building up from basic "music theory from first principles"
all the way to fft, cqt, all that dsp, the circular stacked histogram, oklch...
i think we should have a tts plugin that lets you voice it over.

And it was, of course, hot garbage. But after providing this feedback:

ok let's iterate on that video.
- one: the voiceover is atrocious. toebeans has a tts server making use of
  qwen3-tts-voicedesign -- please use a similar sorta thing to narrate the
  video in the configured voice.
- there's a lot of very loud noise that punctuates the narration. not sure why.
- let's make far more use of generated sounds that correspond with the visuals
  on screen.
- let's spend far less time on the very basics and dig a bit more into detail
  about the particulars of the more sophisticated math.
- make the script more conversational. make it feel like you're talking to a
  friend, or watching a 3blue1brown or 2swap video.
- the key, and admittedly difficult: don't belabor any individual point to try
  to cram facts into the watcher's head, but make it feel like the user could
  have discovered this all themselves.
- use far less text in the video. make very interesting and illustrative
  visuals to make up for the lack of text. this isn't a slideshow. text should
  only ever be used as part of a diagram; not to reexplain things that the
  narration already explains.

We had, substantially, the video you see above.

I followed up with one more cleanup request:

that is a LOT better. let’s use proper typesetting for the math, keep a
consistent speaker voice by generating one VoiceDesign sample to condition off
of (or however you do this with qwen tts), and make sure our diagrams aren’t
overlapping.

And that's all it took.

And yeah, it's still not a fantastic video. But it was engaging enough to capture my attention for all ten minutes the first time I saw it.


AI usage disclaimer: I used Claude to generate svgs for the diagrams. But all prose is mine.

联系我们 contact @ memedata.com