JEPA 模型背后的 90 年代构想：典型相关分析

JEPA 模型背后的 90 年代构想：典型相关分析
The 90-year-old idea behind JEPA models: Canonical Correlation Analysis

原始链接: https://shonczinner.github.io/posts/embedding-prediction/

本文提出，联合嵌入预测架构（JEPA）本质上是哈罗德·霍特林（Harold Hotelling）于1936年引入的典型相关分析（CCA）的非线性架构演进。尽管现代JEPA模型旨在学习数据不同视图间的公共信号，但它们与CCA有着相同的核心目标：最大化多维集合间的相关性。作者论证了在白化约束下，CCA在数学上等同于最小化嵌入表示的均方误差，而这正是JEPA的目标。两者主要的技术差异在于，CCA是线性的且包含防止维度坍缩的白化约束，而JEPA依赖于神经网络，且在其基础形式下存在产生平凡解的风险。SIGReg等近期技术通过强制执行CCA固有的各向同性分布约束解决了这一问题。作者对关于JEPA起源的争论（特别是杨立昆与于尔根·施密德胡伯之间的争论）发表了看法，认为虽然“让思想在大规模下发挥作用”是获得认可的有效依据，但这些概念的历史渊源对于进步仍然至关重要。最终，作者主张将JEPA视为CCA的延伸，能为理解自监督学习架构提供一个有益的统一框架。

抱歉。

原文

Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions.

This is the first sentence from the paper “Relations Between Two Sets of Variates” (Hotelling 1936) by statistician and economist Harold Hotelling. This paper introduced Canonical Correlation Analysis (CCA). In modern terminology, “CCA is used to find a common signal among two large matrices” (Bykhovskaya and Gorin 2025).

In JEPA, the objective is the same except the second data matrix happens to be simply a different view of the same data in the first dataset (e.g. via data augmentation or spatial or temporal proximity). One of the recent papers to acknowledge a connection states, “JEPA-based models implicitly perform a non-linear generalization of Canonical Correlation Analysis”. (Huang 2026)

CCA’s connection to JEPA is relevant to Schmidhuber’s debate on who invented JEPA, which is directed at Yann LeCun. Personally, I think Hotelling deserves the credit for the idea of maximizing correlation in embedding space.

Of course, the CCA model has many differences from JEPA.

For one, CCA does not enforce a shared encoder. But the biggest difference is that CCA is linear. Non-linear neural variants of CCA have been researched with the earliest usage of the term “Deep CCA” being (Andrew et al. 2013).

Connecting JEPA models back to its CCA roots is genuinely useful. Another Deep CCA paper (Benton et al. 2017) relaxed the assumption of two sets of variables to an arbitrary number based on a generalization of CCA proposed in 1961 (Horst 1961). Conceivably, JEPAs could be expanded to handle more than two views as well.

CCA

Suppose we have zero-mean matrices \(X=(x_1,...,x_n)^T\in \mathbb R^{n\times d_x}\) and \(Y=(y_1,...,y_n)^T\in\mathbb R^{n\times d_y}\).

Let \(k\leq \min(d_x,d_y, n)\) and \(A\in \mathbb R^{d_x\times k}\) and \(B\in \mathbb R^{d_y\times k}\) so that \(XA=z_x\in\mathbb R^{n \times k}\) and \(YB=z_y\in\mathbb R^{n \times k}\).

CCA solves the following maximization problem,

\[\max_{A,B} \text{tr}\left(\frac{1}{n}z_x^Tz_y\right) \] \[\text{s.t}\] \[\frac{1}{n}z_x^Tz_x=\frac{1}{n}z_y^Tz_y=I\]

This maximizes the trace of the cross-correlation matrix, while constraining embedding vectors to unit variance and zero covariance.

Similar to the equivalence between maximizing variance and minimizing prediction error in solving PCA, we have a relationship between the trace of the cross-correlation matrix and embedding prediction error,

\[\frac{1}{n}\sum_{i=1}^n ||z_x^{(i)}-z_y^{(i)}||^2=\frac{1}{n}||z_x-z_y||_F^2= \frac{1}{n}\text{tr}(z_x^Tz_x) + \frac{1}{n}\text{tr}(z_y^Tz_y) - \frac{2}{n}\text{tr}(z_x^Tz_y)\] And due to the whitening constraints, \[=2k- \frac{2}{n}\text{tr}(z_x^Tz_y)\]

So maximizing the trace of the cross-correlation under the whitening constraints is equivalent to minimizing the MSE of the embedding representations. Therefore we can write CCA as,

\[\min_{A,B} \frac{1}{n}\sum_{i=1}^n ||z_x^{(i)}-z_y^{(i)}||^2\] \[\text{s.t}\] \[\frac{1}{n}z_x^Tz_x=\frac{1}{n}z_y^Tz_y=I\]

JEPA

Adopting the previous notation, JEPA is constrained to \(d_x=d_y=d\) as a result of the joint-embedding. In JEPA, we have the encoder \(f_\theta:\mathbb R^{d}\rightarrow \mathbb R^k\), and predictor \(g_\varphi:\mathbb R^{k}\rightarrow \mathbb R^k\).

Let \(z_x^{(i)}=g_\varphi(f_\theta(x_i))\), \(z_y^{(i)}=f_\theta(y_i)\).

Then we solve,

\[\min_{\theta,\varphi}\frac{1}{n} \sum_{i=1}^n ||z_x^{(i)}-z_y^{(i)}||^2\]

Note the similarity in the objective function but the lack of whitening constraints. The lack of whitening constraints results in representational and dimensional collapse. For example, a trivial solution to the above problem is \(z_x^{(i)}=z_y^{(i)}=c\).

As discussed in my previous blog post SIGReg (Balestriero and LeCun 2025) fixes this problem. What does it do? It encourages the embeddings \(z_x\) and \(z_y\) to have an isotropic (i.e. unit variance, uncorrelated) Gaussian distribution. As a result it encourages,

\[\frac{1}{n}z_x^Tz_x=\frac{1}{n}z_y^Tz_y=I\]

As I mentioned in the introduction, Schmidhuber has debated who invented JEPA and said this about LeCun,

Dr. LeCun’s heavily promoted Joint Embedding Predictive Architecture (JEPA) is the heart of his new company. However, the core ideas are not original to LeCun. Instead, JEPA is essentially identical to our 1992 Predictability Maximization system.

Schmidhuber references Yann LeCun’s response,

JEPA is merely a name for a general concept. The question is, and has always been, how do you make it work (particularly how do you prevent it from collapsing), and how do you make it work at scale with SOTA results on non-toy problems. That’s the hard part. Ideas are a dime a dozen. Making them work is what the community will give you credit for.

Do I agree with LeCun? Yes and no.

Yes, because of course you will get credit for making things work, and ideas are indeed arguably “a dime a dozen”.

No, because the thread of citations is important for progress. If important citations are missed, whether intentionally or not, the correct thing to do is just add them. We’re all only the better for doing so. The connection that JEPA models have to CCA is informative.

My opinion is that JEPA/Predictability Maximization models are architectural enhancements layered on top of CCA. Non-linearity is an enhancement.

Ultimately, these models all have the same objective function introduced by CCA: find the transformations that result in maximal correlation between sets of multidimensional data.

References

Andrew, Galen, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. “Deep Canonical Correlation Analysis.” International Conference on Machine Learning, 1247–55. https://proceedings.mlr.press/v28/andrew13.html.

Balestriero, Randall, and Yann LeCun. 2025. LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. https://arxiv.org/abs/2511.08544.

Benton, Adrian, Huda Khayrallah, Biman Gujral, Dee Ann Reisinger, Sheng Zhang, and Raman Arora. 2017. Deep Generalized Canonical Correlation Analysis. https://arxiv.org/abs/1702.02519.

Bykhovskaya, Anna, and Vadim Gorin. 2025. Canonical Correlation Analysis: Review. https://arxiv.org/abs/2411.15625.

Horst, Paul. 1961. Generalized Canonical Correlations and Their Application to Experimental Data. Journal of clinical psychology.

Hotelling, Harold. 1936. “Relations Between Two Sets of Variates.” Biometrika 28 (3/4): 321–77. http://www.jstor.org/stable/2333955.

Huang, Yongchao. 2026. VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models. https://arxiv.org/abs/2601.14354.