用你虚假的耳朵听。

用你虚假的耳朵听。
See it with your lying ears

原始链接: https://lcamtuf.substack.com/p/see-it-with-your-lying-ears

## 音频与图像变换：一次有趣的探索本文探讨了栅格图形和音频处理之间令人惊讶的相似之处，源于两者都是模拟测量值的序列这一理念。作者尝试将基于图像的变换应用于音频，反之亦然，揭示了有趣的结果。 “像素化”音频——降采样和求平均值——由于耳蜗对波形跳跃的敏感性，会产生不悦的金属泛音，这与图像产生的视觉上可接受的像素艺术不同。平滑波形可以解决这个问题。相反，降低音频*位深度*会引入嘶嘶声，这是量化造成的后果。将图像滤波器应用于音频会产生进一步的见解。延迟和衰减音频（如模糊图像）会产生回声和合唱等效果。频率域分析（使用FFT）对音频来说功能强大——能够实现人声编码器和自动调音——但由于会产生伪影，因此对图像编辑具有挑战性。作者详细介绍了使用窗口技术（如汉宁函数）克服FFT限制的方法，以避免编辑过程中出现可听见的咔哒声，从而实现选择性的音高变换。最终，这次探索强调了我们对视觉和听觉数据的感知差异，以及看似相似的数据表示形式需要不同的处理方法。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录用你欺骗的耳朵来听 (lcamtuf.substack.com) 11 分，来自 fratellobigio 39 分钟前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

For the past couple of weeks, I couldn’t shake off an intrusive thought: raster graphics and audio files are awfully similar — they’re sequences of analog measurements — so what would happen if we apply the same transformations to both?…

Let’s start with downsampling: what if we divide the data stream into buckets of n samples each, and then map the entire bucket to a single, averaged value?

for (pos = 0; pos < len; pos = win_size) {
    
  float sum = 0;
  for (int i = 0; i < win_size; i++) sum += buf[pos + i];
  for (int i = 0; i < win_size; i++) buf[pos + i] = sum / win_size;

}

For images, the result is aesthetically pleasing pixel art. But if we do the same audio… well, put your headphones on, you’re in for a treat:

The model for the images is our dog, Skye. The song fragment is a cover of “It Must Have Been Love” performed by Effie Passero.

If you’re familiar with audio formats, you might’ve expected this to sound different: a muffled but neutral rendition associated with low sample rates. Yet, the result of the “audio pixelation” filter is different: it adds unpleasant, metallic-sounding overtones. The culprit is the stairstep pattern in the resulting waveform:

Our eyes don’t mind the pattern on the computer screen, but the cochlea is a complex mechanical structure that doesn’t measure sound pressure levels per se; instead, it has clusters of different nerve cells sensitive to different sine-wave frequencies. Abrupt jumps in the waveform are perceived as wideband noise that wasn’t present in the original audio stream.

The problem is easy to solve: we can run the jagged waveform through a rolling-average filter, the equivalent of blurring the pixelated image to remove the artifacts:

But this brings up another question: is the effect similar if we keep the original 44.1 kHz sample rate but reduce the bit depth of each sample in the file?

/* Assumes signed int16_t buffer, produces n + 1 levels for even n. */

for (int i = 0; i < len; i++) {

  int div = 32767 / (levels / 2);
  buf[i] = round(((float)buf[i]) / div) * div;

}

The answer is yes and no: because the frequency of the injected errors will be on average much higher, we get hiss instead of squeals:

Also note that the loss of fidelity is far more rapid for audio than for quantized images!

As for the hiss itself, it’s inherent to any attempt to play back quantized audio; it’s why digital-to-analog converters in your computer and audio gear typically need to incorporate some form of lowpass filtering. Your sound card has that, but we injected errors greater than what the circuitry was designed to mask.

But enough with image filters that ruin audio: we can also try some audio filters that ruin images! Let’s start by adding a slightly delayed and attenuated copy of the data stream to itself:

for (int i = shift; i < len; i++)
  buf[i] = (5 * buf[i] + 4 * buf[i - shift]) / 9;

Check it out:

For photos, small offsets result in an unappealing blur, while large offsets produce a weird “double exposure” look. For audio, the approach gives birth to a large and important family of filters. Small delays give the impression of a live performance in a small room; large delays sound like an echo in a large hall. Phase-shifted signals create effects such as “flanger” or “phaser”, a pitch-shifted echo sounds like a chorus, and so on.

So far, we’ve been working in the time domain, but we can also analyze data in the frequency domain; any finite signal can be deconstructed into a sum of sine waves with different amplitudes, phases, and frequency. The two most common conversion methods are the discrete Fourier transform and the discrete cosine transform, but there are more wacky options to choose from if you’re so inclined.

For images, the frequency-domain view is rarely used for editing because almost all changes tend to produce visual artifacts; the technique is used for compression, feature detection, and noise removal, but not much more; it can be used for sharpening or blurring images, but there are easier ways of doing it without FFT.

For audio, the story is different. For example, the approach makes it fairly easy to build vocoders that modulate the output from other instruments to resemble human speech, or to develop systems such as Auto-Tune, which make out-of-tune singing sound passable.

In the earlier article, I shared a simple implementation of the fast Fourier transform (FFT) in C:

void __fft_int(complex* buf, complex* tmp, 
               const uint32_t len, const uint32_t step) {

  if (step >= len) return;
  __fft_int(tmp, buf, len, step * 2);
  __fft_int(tmp + step, buf + step, len, step * 2);

  for (uint32_t pos = 0; pos < len; pos += 2 * step) {
    complex t = cexp(-I * M_PI * pos / len) * tmp[pos + step];
    buf[pos / 2] = tmp[pos] + t;
    buf[(pos + len) / 2] = tmp[pos] - t;
  }

}

void in_place_fft(complex* buf, const uint32_t len) {
  complex tmp[len];
  memcpy(tmp, buf, sizeof(tmp));
  __fft_int(buf, tmp, len, 1);
}

Unfortunately, the transform gives us decent output only if the input buffer contains nearly-steady signals; the more change there is in the analysis window, the more smeared and intelligible the frequency-domain image. This means we can’t just take the entire song, run it through the aforementioned C function, and expect useful results.

Instead, we need to chop up the track into small slices, typically somewhere around 20-100 ms. This is long enough for each slice to contain a reasonable number of samples, but short enough to more or less represent a momentary “steady state” of the underlying waveform.

An example of FFT windowing.

If we run the FFT function on each of these windows separately, each output will tell us about the distribution frequencies in that time slice; we can also string these outputs together into a spectrogram, plotting how frequencies (vertical axis) change over time (horizontal axis):

Audio waveform (top) and its FFT spectrogram view.

Alas, the method isn’t conductive to audio editing: if we make separate frequency-domain changes to each window and then convert the data back to the time domain, there’s no guarantee that the tail end of the reconstituted waveform for window n will still line up perfectly with the front of the waveform for window n + 1. We’re likely to end up with clicks and other audible artifacts where the FFT windows meet.

A clever solution to the problem is to use the Hann function for for windowing. In essence, we multiply the waveform in every time slice by the value of y = sin²(t), where t is scaled so that each window begins at zero and ends at t = π. This yields a sinusoidal shape that has a value of zero near the edges of the buffer and peaks at 1 in the middle:

The Hann function for FFT windows.

It’s hard to see how this would help: the consequence of the operation is that the input waveform is attenuated by a repeating sinusoidal pattern, and the same attenuation will carry over to the reconstituted waveform after FFT.

The trick is to also calculate another sequence of “halfway” FFT windows of the same size that overlap the existing ones (second row below):

Adding a second set of FFT windows.

This leaves us with one output waveform that’s attenuated in accordance with the repeating sin²pattern that starts at the beginning of the clip, and another waveform that’s attenuated by an identical sin²pattern shifted one-half of the cycle. The second pattern can be also written as cos².

With this in mind, we can write the equations for the reconstituted waveforms as:

\(\begin{array}{c} out_1(t) = in(t) \cdot sin^2(t) \\ out_2(t) = in(t) \cdot cos^2(t) \\ \end{array}\)

If we sum these waveforms, we get:

\(out(t) = in(t) \cdot [sin^2(t) + cos^2(t)] \\ \)

This is where we wheel out the Pythagorean identity, an easily-derived rule that tells us that the following must hold for any x:

\(sin^2(x) + cos^2(x) = 1\)

In effect, the multiplier in the earlier equation for the summed waveform is always 1; the Hann-induced attenuation cancels out.

At the same time, because the signal at the edges of each FFT bucket is attenuated to zero, we get rid of the waveform-merging discontinuities. Instead, the transitions between buckets are gradual and mask any editing artifacts.

Where was I going with this? Ah, right! With this trick up of our sleeve, we can goof around in the frequency domain to — for example — selectively shift the pitch of the vocals in our clip:

Source code for the effect is available here. It’s short and easy to experiment with.

I also spent some time approximating the transform for the dog image. In the first instance, some low-frequency components are shifted to higher FFT buckets, causing spurious additional edges to crop up and making Skye look jittery. In the second instance, the buckets are moved in the other direction, producing a distinctive type of blur.

PS. Before I get hate mail from DSP folks, I should note that high-quality pitch shifting is usually done in a more complex way. For example, many systems actively track the dominant frequency of the vocal track and add correction for voiceless consonants such as “s”. If you want to down a massive rabbit hole, this text is a pretty accessible summary.

As for the 20 minutes spent reading this article, you’re not getting that back.

用你虚假的耳朵听。 See it with your lying ears

用你虚假的耳朵听。
See it with your lying ears