Meta 的不稳定签名
Meta's Un-Stable Signature

原始链接: https://hackerfactor.com/blog/index.php?/archives/1098-Metas-Un-Stable-Signature.html

作者针对几项主流AI水印算法(Google 的 SynthID、Adobe 的 TrustMark 以及 Meta 的 Stable Signature)的调查显示,这些系统的可靠性远低于开发商的宣传。 其核心问题源于一个根本性的统计错误:这三家公司均假设其神经网络生成的比特流是独立且均匀分布的。然而在实践中,这些网络会产生有偏见的、非随机的输出聚类和结构性“吸引子”,从而导致高误报率。虽然开发商宣称其准确率高达“百万分之一”,但实证测试表明,实际错误率接近于:Meta 为四分之一,Adobe 为五分之一,Google 为二十分之一。 作者认为,这些系统普遍存在“表征坍塌”问题,即神经网络无法生成有效统计识别所必需的随机分布。由于监管机构和法律体系正日益依赖这些工具来验证内容,这种不可靠性引发了严重担忧。最终,作者指出,现代基于 AI 的水印技术存在根本性缺陷,目前尚不适用于法律证据、保险或强制性合规监管等高风险应用场景。

Hacker News 最新 | 往日 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Meta 的不稳定签名 (hackerfactor.com) 8 点,由 ementally 发布于 2 小时前 | 隐藏 | 往日 | 收藏 | 讨论 | 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文
I'm wrapping up my investigation into invisible watermark algorithms and I am extremely disappointed. Not only do none of the modern AI-based algorithms work as they claim, it turns out that they are all making the same fundamental mistake.

I previously evaluated Google's SynthID and Adobe's TrustMark algorithms. Both of them claim to have incredibly accurate results.

  • According to Google's peer-reviewed and published paper, they claim to have a true positive rate (TPR) above 99.97% -- meaning that they will miss their own watermarks less than 1 in 10,000 times. However, my own empirical testing found that is it much closer to 1 in 20. Moreover, SynthID is proprietary and only accessible through Google's "Gemini" AI system. Gemini has been observed hallucinating results and providing contradictory conclusions depending on how the question is phrased.

  • According to Adobe's Content Authenticity Initiative, their TrustMark "can exceed 96% bit accuracy at around 42-45dB PSNR quality under severe noise degradations". However, that statistic focuses on resilience and not accuracy. In my empirical tests, I found that TrustMark has a 10%-20% false positive rate, effectively making it useless. (If you see a TrustMark signature, then it is very likely random noise and not an actual signature.)
This time, I evaluated Meta's "Stable Signature" algorithm. (Their paper and code are in GitHub.) This system encodes a 48-bit sequence into the picture's visual content. The idea is that you can encode a unique 48-bit sequence as your watermark. If your decoder finds the same 48-bit sequence, then it can identify your own watermark.

WARNING: This blog entry leans heavily into math and statistics to prove that Stable Signature, TrustMark, and SynthID are nowhere near as reliable as their developers claim.

The Basic Algorithm

Traditional (non-AI) invisible watermarks typically hide in subtle locations, such as the least significant bits, changes in brightness (e.g., Digimarc) or the frequency spectrum (DCT or FFT). There is always the risk that image encoding could corrupt the hidden data, so these algorithms typically rely on repetition over the image to help identify the true signal. In addition, they may include error correction code (extra bits in the data) to fix any minor data errors.

However, there is a problem with the traditional approaches: injecting hidden data in the image could create visible distortions. The modern approach uses an AI system to better hide the data with less added distortion.

As with SynthID and TrustMark, Stable Signature encodes binary data and uses an AI-model to decide where to hide it in the image. The AI is tuned to minimize visible distortions when embedding the data. Later, an AI-based decoder looks at the image and identifies the likely location where bits are stored, then it extracts the data.

There is always the case that the data may be mixed with noise. Different AI-based watermarking systems rely on different techniques for reducing the noise. For example:

  • Google's SynthID only stores a few bits of data (effectively a flag or version number). This allows them to use a lot of data as repetition and to increase the accuracy rate.

  • Adobe's TrustMark uses the Bose-Chaudhuri-Hocquenghem (BCH) algorithm. This acts as a combination of checksum and error correcting code that should reduce the number of errors.
Meta's Stable Signature uses a simple Hamming distance.

The Hamming distance measures the number of bits that need to be swapped in order to correct the code. In effect, it defines a set of stable states (e.g., 10110 and 11000) and places a ring around each state that represents the single bit changes. If you change enough bits, then you will reach a different stable state.

According to Meta's Stable Signature research paper, the 48-bits should be uniformly distributed and cites a "false positive rate below 10-6", or 1 in one million. This means you can choose a 48-bit sequence to use as your signature. Every picture will generate a 48-bit sequence, and the sequence can vary a little based on noise in the picture. However, if you find a code that is within a short Hamming distance of your code (e.g., within 6 bits difference), then you can determine that it is the same code with a high reliability.

At least, that's the theory.

Empirical Testing

I went into this experiment assuming that everything works like they claim. I want to be able to reliably identify invisible watermarks associated with Meta. What I don't know is what sequence they use, or whether they use multiple codes depending on whether it comes from Meta's AI system, Facebook, Instagram, WhatsApp, etc.

Fortunately, this is something I can test! I grabbed an uncurated sample of pictures from FotoForensics: the first 10,000 unique images uploaded last month (May 2026). If the bit sequences are uniformly distributed with a "1 in 1 million" collision rate, then I should see a huge number of unique bit sequences and a few small clusters around pictures from Meta (Meta AI, Facebook, Instagram, etc.). Those clusters will represent the invisible watermarks used by Meta.

The results from my empirical test were definitely not what I expected. I found:

It's not just one random cluster that is massively large (450 pictures out of 10,000). There's a cluster of 184 pictures at 110101001011001011001011111000100111001000011101, 58 pictures at 110100000011111010001001111000100111011000011101, etc. I found over 60 clusters with more than 10 pictures each at a Hamming distance of 6. That should not happen with a "1 in 1 million" collision rate.

Independent Analysis

I went back to Meta's research paper to see if I could find the discrepancy. And there it was, in section 3.1: They tested their system against the hypothesis that the 48-bits are each independent and uniformly distributed. The problem is, they use one neural network to generate the bits. That explicitly means that the bits are dependent, not independent.

Their paper assumes a binomial distribution. That is, given an arbitrary image, the 48-bits represent a random coin flip. The math becomes:
P(XT)=Tk=0(48k)(0.5)k(0.5)48k

This computes the probability of 48 random bits being within a Hamming distance (T). The probabilities table becomes:

Hamming Distance Threshold (T)Bit Error Rate (BER)Probability of a Random Image Matching by Chance
14 bits or fewer≤ 29.17%1 in 362.63
13 bits or fewer≤ 27.08%1 in 957.81
12 bits or fewer≤ 25.00%1 in 2,788.35
11 bits or fewer≤ 22.92%1 in 8,999.08
10 bits or fewer≤ 20.83%1 in 32,416.80
9 bits or fewer≤ 18.75%1 in 131,390.28
8 bits or fewer≤ 16.67%1 in 605,094.89
7 bits or fewer≤ 14.58%1 in 3.20 Million
6 bits or fewer≤ 12.50%1 in 19.83 Million
5 bits or fewer≤ 10.42%1 in 146.19 Million
4 bits or fewer≤ 8.33%1 in 1.32 Billion
3 bits or fewer≤ 6.25%1 in 15.24 Billion
2 bits or fewer≤ 4.17%1 in 239.15 Billion
1 bit or fewer≤ 2.08%1 in 5.74 Trillion
0 bits (perfect match)= 0.00%1 in 281.47 Trillion

Meta's paper says that they use a Hamming distance of 7 bits (requiring 41 of 48 bits), which matches their claim of a "false positive rate below 10−6". However, I'm seeing problems at a Hamming distance of 6 (should be 1 in 20 million) and even collisions at 0 (1 in 281 trillion)!

The Core Problem

There is clearly a discrepancy between the theoretical probabilities and the empirical testing. When I looked back over Meta's research paper, I saw the problem:

According to Meta's paper, each of the 48-bits are independent. In a perfectly independent 48-bit hypercube, un-watermarked images should scatter uniformly across all 248 possible values. However, neural networks map a non-linear manifold (a multi-dimensional wavy surface) through this hypercube. This mathematical landscape is warped with its own peaks, ravines, and valleys. It has attractors that form clusters, and repulsers that form voids where stable values can never exist; this is a feature of a neural network. And most importantly, the output bits are explicitly not independent.

The left diagram illustrates an expected uniform distribution if all of the bits were independent. The right diagram are the types of theoretical clusters that form when the bits are dependent. There should be clusters around attractors and voids (areas with no dots) from the repelling regions.

Moving from theoretical to empirical, I graphed the data. The 48 bits can be represented as bytes. I took the first 24 bits and converted them into 8-bit red, green, and blue pixel colors. If the data is truly random, then the colored dots should be distributed across the RGB cube. However, if the bits are dependent, then there should be very clear clusters, structures, and voids. Here's the graph:

Yes, there are very clear structures that look like planes and lines. Within the planes are clusters, and outside the planes are very large voids -- areas where there are no dots at all. The data generated by Meta's Stable Signature implementation fails this basic test for independence.

The biggest cluster that I found represents a Zero Signal Bias (ZSB). When their neural network doesn't find a watermark, it moves the 48 bits toward a strong attractor, like a massive gravitational well. At 6 bits error, it should have a collision of around 1 in 20 Million. But in reality, my 10,000 pictures had a cluster of 450 images within 6 bits due to the ZSB. That's an error rate of around 1 in 22 with the ZSB alone. If we add in all of the other clusters that contain at least 10 pictures, then 2327 pictures are in various clusters; we're looking at an error rate around 1 in 4 -- and that's at a Hamming distance of 6, which is more conservative than their paper's Hamming distance of 7. (In AI terms, this is a representation collapse or structural bias that is typical for deep neural networks.)

(As an aside: Given their "1 in 1 million" claim, I could look for any clusters of 2 or more pictures. At clusters of 2 or larger, 5,237 of the 10,000 test images were in clusters, or 52%. If you show their algorithm 10,000 pictures, then there is a better-than 50% chance of a false positive match.)

Less Than Random

It's one thing for me to claim that there are visible clusters and to show pictures of clusters, but another to prove it mathematically. (Time to dust off my college textbooks from "Introduction to Statistics"...)

I fed Meta's code the first 10,000 images from May 2026. A few of the images were in unsupported formats (HEIC, WebP, and a few corrupted JPEG files), resulting in 9,847 viable pictures. I evaluated this data with elements from the NIST Statistical Test Suite (SP 800-22) for randomness, including a monobit test and Chi-Squared (χ2) test for independence.

The monobit test determines if the baseline frequency of adjacent bits seems independent.

  • Total Bits Processed: 9,847 pictures × 48 bits per signature = 472,656 bits
  • Observed Count of Ones ('1'): 266,419
  • Observed Count of Zeros ('0'): 206,237
  • Expected Count (E): 236,328 for each.
Running a simple standard Chi-Square Goodness-of-Fit test for this bit balance:
χ2=(266419 − 236328)2236328+(206237 − 236328)2236328= 3816.14 + 3816.14 = 7632.28
  • In mathemat-ese: with 1 degree of freedom, a χ2 statistic of 7,632.28 yields a p-value infinitely close to 0.0 (p ⋘ 10-100). (As an aside, most Chi-square tables usually evaluate the 1 degree of freedom up to around χ2=10. This χ2 value is so astronomically high that the probability p effectively becomes zero.)

  • In English: That's definitely not random or independent.
The watermark extraction is strongly biased toward producing 1s over 0s across global arbitrary images (roughly 56% ones to 44% zeros). This immediately violates the uniform distribution assumption.

The second test is the Chi-Square (χ2) Test for Serial Independence. If the bits were independent, the transition probability between adjacent bits would just be the product of their individual probabilities. This table shows the occurrence rate of the transition pairs across all of the observed 10,000 (well, 9,847) pictures:

Transition PairObserved Count (O)Expected Count under Independence (E)
0 to 0106,75090,051
0 to 195,296116,186
1 to 095,302116,186
1 to 1165,461149,976

χ2=(OE)2Eχ2=16699290051+(−20890)2116186+(−20884)2116186+154852149976=3096.7 + 3756.2 + 3754.0 + 1599.0=12,205.9
  • In mathemat-ese: With 1 degree of freedom for the transition contingency table (accounting for fixed margins), a χ2 value of 12,205.9 gives a p-value of 0.0.

  • In English: Ain't no way this is random or independent.
And as if this wasn't conclusive enough, there are other tests we could apply:
  • Static Tail Patterns: Looking closely at the end of the 48-bit sequences, a massive cluster of strings end explicitly in ...111101 or ...00111101. Additionally, bit position 46 is nearly always "1" (228 zeros vs 9619 ones, or 97.7% of the time it is "1"), position 47 is "0" (8958 of 9847 images, or 90.97%), and position 48 is "1" (found with 9696 images, or 98.5%) across thousands of uncurated, real-world images.

  • Structural Clustering: Certain bit columns share an extraordinarily high Mutual Information score (I(X;Y)). For example, knowing the output of bit position 12 gives you better than an 80% accuracy in predicting bit position 28.
The assumption of a "uniform distribution over arbitrary pictures" relies on the idealistic premise that random natural image features project uniformly across the decision boundaries of a network. However, because the extraction network maps inputs to a constrained, highly continuous hyper-dimensional manifold, the network's latent layers natively enforce structural smoothness.

For the TL;DR crowd:
Meta's researchers made a fundamental mistake when computing their accuracy rates. It's not a "1 in 1 million" chance of a false match, it's closer to 1 in 4 -- because the 48 bit values per signature are not independent.

As I re-read Meta's research paper, I realized that the statistical error wasn't an oversight; Meta's researchers explicitly acknowledged the problem. In their paper (Section 4.1), they wrote:

Second, we observed that W’s output bits for vanilla images are correlated and highly biased, which violates the assumptions of Sec. 3.1 [the section about independent statistical test methods].
In other words, they recognized that the extracted bits are not independent. Despite this, their published false-positive analysis still relies on the assumption that the bits are independent.

Widespread Problems

Knowing that Meta's accuracy rate is grossly inflated due to assuming bit-wise independence when there is none, I looked back over Google's and Adobe's papers for their own watermarks. Did Google's and Adobe's researchers make this same mistake?
  • Google's SynthID research paper talks in terms of True Positive Rates (TPR). They do make this same "bit-wise independent" mistake, but it's obfuscated in the paper. You can see the error in their Equation 3 (PDF page 8), where they assume there is a uniform (independent) distribution. Their paper hyperfocuses on the true positive rate and never addresses the false positive distribution. (Either they didn't know to look, or they knew and decided to not report it because it would expose a serious weakness in their solution.)

  • Adobe's TrustMark research paper also makes assumptions of independence. You can see this in their PDF with the binary cross-entropy loss in Section 3.1.4. This mathematically treats each bit position as an independent Bernoulli trial. (By definition, a Bernoulli process strictly requires independence.) In their experiments (Section 4.1), they wrote "At test time, every image is associated with a random watermark", but they never tested if the random watermarks were similar to each other.
This introduction-to-statistics mistake is found in all three of these invisible watermarking technologies. The detections produced by these systems are so unreliable that an analyst cannot determine whether a reported detection is real or a false positive, or whether a reported non-detection is genuine or a false negative.

It's also worth noting that, shortly after releasing Stable Signature, Meta developed another algorithm: Pixel Seal. (Not to be confused with my own Secure Evidence Attribution Label / SEAL technology.) Pixel Seal moves to a 256-bit payload to increase the capacity, and their related model, Chunky Seal, pushes up to 1024 bits. While Meta's approach focuses heavily on addressing the invisibility side using an adversarial-only discriminator, the underlying approach still uses a neural network mapping. Using more bits only exacerbates this flaw.

Potential Uses

Algorithms can have uses. For example, Meta, Google, and Adobe are training their own AI models on images that they encounter. To prevent poisoning their training sets, they want to exclude images generated by their own systems. In this regard, watermarking does help them. For example, if Meta excludes an extra 25% of images (from false positives), then they still have a lot of images that they can train on.

However, that same usage does not work with legal cases. For example, consider an insurance company. Most insurance claims today include photographic evidence. The company wants camera-original photos, but have to use whatever the customer submits. The problem is that there is a lot of insurance fraud. In theory, seeing a watermark from an AI system like Meta, Google, or Adobe, should be great for identifying and ruling out fraud. Unfortunately, Stable Signature, SynthID, and TrustMark are so inaccurate that none of them can be trusted; it's not even worth testing to see if customer photos contain these invisible watermarks.

For these watermarking systems, I'm talking about very high error rates: roughly 1-in-4 for Meta, 1-in-5 for Adobe, and 1-in-20 for Google. But let's pretend that they work much better, like a 1-in-20,000 false positive rate. An insurer processing 100,000 claims per month would expect to accuse around 5 completely honest customers of fraud each month. Falsely denying 5 out of 100,000 claims? That creates a toxic customer service nightmare, severe legal liability, and fines from regulatory bodies for bad-faith claim denials. This could even become a class-action lawsuit that they couldn't win.

As bad as it is for insurance and financial institutions, there are much higher stakes at play. The EU AI Act (Article 50(2)), China's GB 45438-2025, California SB 942, and similar legislation are moving toward mandating AI content watermarking.

The failure of these three leading systems, from three Fortune-500 companies, to meet their own claimed accuracy rates is not just an academic curiosity. Regulators and courts will employ these systems for attribution and fraud detection. Reliable AI-based watermarking technology is not ready.

Three companies. Three algorithms. Three different research teams. The same fundamental error. The false positives won't go on trial. People will.

联系我们 contact @ memedata.com