LoGeR – 从极长视频进行3D重建 (DeepMind, UC Berkeley)

LoGeR – 从极长视频进行3D重建 (DeepMind, UC Berkeley)
LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley)

## LoGeR：将3D重建扩展到长视频 LoGeR是一种新的视频3D重建方法，旨在克服当前方法在处理极长序列（最多19,000帧）时的局限性。传统的直接重建方法在计算成本（“上下文壁垒”）和泛化到大型环境（“数据壁垒”）方面都存在困难。 LoGeR通过**混合内存架构**解决这个问题，该架构结合了**滑动窗口注意力（SWA）**以实现精确的局部对齐，以及**测试时训练（TTT）**以实现长距离的全局一致性。它以块的形式处理视频，使用SWA在块*之间*保持高保真度的几何结构，并使用TTT防止整个序列的比例漂移。这种基于块的方法实现了亚二次级扩展，而不会牺牲准确性。LoGeR在长视野（千米级轨迹）和标准短序列基准测试（如KITTI和7-Scenes）上均表现出比现有方法显著的改进，在重建和姿态准确性方面实现了最先进的结果，同时保持更快的处理速度。代码和论文已公开发布。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 LoGeR – 从极长视频进行3D重建 (DeepMind, UC Berkeley) (loger-project.github.io) 5 分，helloplanets 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 msuniverse2026 13分钟前 [–] 真的不理解这些研究人员在想什么。他们看不到这主要用途将是大规模监控吗？回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

🚧Under construction

Junyi Zhang^1,2 Charles Herrmann^1,* Junhwa Hur^1,* Chen Sun¹ Ming-Hsuan Yang¹
Forrester Cole¹ Trevor Darrell² Deqing Sun^1,†
¹ Google DeepMind ² UC Berkeley (*: Project leads, †: Direction lead)

LoGeR scales feedforward dense 3D reconstruction to extremely long videos. By processing video streams in chunks and bridging them with a novel hybrid memory module, LoGeR alleviates quadratic complexity bottlenecks. It combines Sliding Window Attention (SWA) for precise local alignment with Test-Time Training (TTT) for long-range global consistency, reducing drift over massive sequences up to 19,000 frames without any post-hoc optimization.

[Paper] [ArXiv] [Code] [BibTeX]

Visual Results

Scaling to unprecedented horizons. Even without backend optimization, LoGeR maintains strong geometric coherence and reduces scale drift over kilometer-scale trajectories.

Visual gallery. Qualitative results on expansive in-the-wild and VBR sequences. Our fully feedforward approach accurately preserves large-scale structures and loop closures over thousands of frames.

Why Is Long-Context Reconstruction Hard?

Scaling feedforward 3D reconstruction to minutes-long videos is blocked by two fundamental barriers: an architectural "context wall" that restricts sequence length, and a training "data wall" that limits generalization to expansive environments.

Context Wall

While full bidirectional models (e.g., VGGT, π³) excel at local reasoning, their quadratic cost prohibits long-context scaling. Linear-memory alternatives (e.g., CUT3R, TTT3R) solve the computation bottleneck, but introduce lossy compression that degrades fine-grained geometric alignment.

Architectural trade-off. LoGeR bypasses this trade-off with a hybrid memory architecture that maintains sub-quadratic linear scaling while preserving high-fidelity local geometry (via SWA) and ensuring global structure consistency (via TTT).

Data Wall

Simply engineering efficient attention (e.g., FastVGGT) isn't enough. Models trained solely on short-context "bubbles" inevitably fail to generalize to expansive, large-scale scenes.

While efficient variants like FastVGGT alleviate memory bottlenecks, they still collapse on large-scale VBR trajectories.

How LoGeR Works

Scaling 3D reconstruction to minutes-long videos requires rethinking how we process and store geometric context. LoGeR introduces a chunk-based hybrid architecture that decouples short-range alignment from long-range global anchoring.

Method: Causal Chunk-wise Processing with Hybrid Memory Module

High-Level Abstraction. Instead of processing the entire video at once, LoGeR partitions the stream into manageable chunks. To maintain coherence across chunks, it employs a dual-pathway hybrid memory: Local Memory (SWA) ensures uncompressed, high-precision alignment between adjacent boundaries, while Global Memory (TTT) continually updates a compressed state to prevent scale drift over long sequences.

Detailed block structure of the hybrid memory module. Seamlessly integrating SWA and TTT to bridge consecutive video chunks.

Inside the Hybrid Memory Block

The internal data flow of a single residual block consists of four sequential operations:

1. Per-Frame Attention Extracts spatial features independently for each frame to establish the 2D visual foundation.

2. Sparse SWA (Local Memory) Establishes a lossless information path across adjacent chunks to preserve high-precision geometric alignment.

3. Chunk-Wise TTT (Global Memory) Integrates long-range context by maintaining fast weights via an efficient apply-then-update procedure.

4. Chunk-Wise Bi-Attention Performs powerful, dense geometric reasoning within all the frames of the current chunk.

Results on Long Sequences

Strong performance on long horizons. On standard KITTI benchmarks, LoGeR reduces the average ATE to 18.65. On the 19k-frame VBR dataset, it delivers a 30.8% relative improvement over prior feedforward approaches.

KITTI. Both Pi3-Chunk and LoGeR strongly outperform prior feedforward baselines, and LoGeR* achieves the best average ATE at 18.65.

VBR quantitative. LoGeR increasingly pulls away as sequence length grows from 1k to 19k on extremely long trajectories.

VBR qualitative. LoGeR accurately preserves global scale and trajectory over very long sequences, closely matching ground truth where prior methods suffer from severe drift.

Results on Short Sequences

LoGeR also remains highly competitive on short-sequence benchmarks. It achieves state-of-the-art reconstruction and pose accuracy while running significantly faster than full-attention baselines like VGGT.

3D reconstruction of 7-Scenes (under TTT3R protocol). LoGeR and the Pi3-Chunk baseline both outperform prior work on 7-Scenes reconstruction, and LoGeR shows a 69.2% relative gain in our evaluation.

3D reconstruction of 7-Scenes (under VGG-T³ protocol). LoGeR maintains clear gains as we scale from 100 to 1k frames. At 1k frames, it reports 90.3% and 72.1% error reduction against TTT3R and VGG-T³, respectively.

Pose evaluation on ScanNet and TUM-Dynamics (under TTT3R protocol). LoGeR substantially outperforms prior work on ScanNet and TUM-Dynamics, achieving 80.0% and 66.1% relative gains in our evaluation.

BibTex

@article{zhang2026loger,
  title={LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory},
  author={Zhang, Junyi and Herrmann, Charles and Hur, Junhwa and Sun, Chen and Yang, Ming-Hsuan and Cole, Forrester and Darrell, Trevor and Sun, Deqing},
  journal={arXiv preprint arXiv:2603.03269},
  year={2026}
}

Acknowledgements: We borrow this webpage template from SD+DINO, which is originally adapted from DreamBooth.