4DLangVGGT: 4D Language Visual Geometry Grounded Transformer

¹State Key Laboratory of Precision Blasting, Jianghan University,
²Harvard AI and Robotics Lab, Harvard University,
³School of EIC, Huazhong University of Science and Technology,
⁴Department of Computing, The Hong Kong Polytechnic University,
⁵Department of Computer Science, Hong Kong Baptist University,
⁶School of Mathematics and Statistics, Hubei University of Education
^#Equal contribution ^*Corresponding Author

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications.

To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity.

Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding.

Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training.

Framework

The framework integrates a geometry encoder, a semantic bridging decoder, and a multi-objective training strategy to achieve language-aware 4D fields with geometric fidelity and semantic alignment.

Input video frames are processed by StreamVGGT to obtain geometry tokens. The Semantic Bridging Decoder predicts both RGB reconstructions and semantic embeddings, while the geometry decoder estimates depth maps and camera poses. Inverse-projection lifts them into a 3D point cloud, onto which the predicted RGB and semantics are colorized, yielding 3D frames and 3D semantic maps.

Results

4DLangVGGT evaluate two training regimes to examine both cross-scene applicability and per-scene perfor- mance. The first regime trains a single model on multiple videos and applies this shared model for inference across different scenes (“multi-video single model”). The second regime adopts the per-scene protocol used in 4DLangSplat, i.e., training one model per scene. This per-scene setting is included to align with existing Gaussian splatting methods and to provide a fair comparison of our method’s performance.

@article{wu20254dlangvggt, title={4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer}, author={Wu, Xianfeng and Bai, Yajing and Li, Minghan and Wu, Xianzu and Zhao, Xueqi and Lai, Zhongyuan and Liu, Wenyu and Wang, Xinggang}, journal={arXiv preprint arXiv:2512.05060}, year={2025}}

4DLangVGGT: 4D Language Visual Geometry Grounded Transformer

Abstract

Framework

Results

Visulizations

BibTeX