Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications.

To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity.

Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding.

Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training.

Framework

The framework integrates a geometry encoder, a semantic bridging decoder, and a multi-objective training strategy to achieve language-aware 4D fields with geometric fidelity and semantic alignment.
Input video frames are processed by StreamVGGT to obtain geometry tokens. The Semantic Bridging Decoder predicts both RGB reconstructions and semantic embeddings, while the geometry decoder estimates depth maps and camera poses. Inverse-projection lifts them into a 3D point cloud, onto which the predicted RGB and semantics are colorized, yielding 3D frames and 3D semantic maps.

Results

4DLangVGGT evaluate two training regimes to examine both cross-scene applicability and per-scene perfor- mance. The first regime trains a single model on multiple videos and applies this shared model for inference across different scenes (“multi-video single model”). The second regime adopts the per-scene protocol used in 4DLangSplat, i.e., training one model per scene. This per-scene setting is included to align with existing Gaussian splatting methods and to provide a fair comparison of our method’s performance.

Visulizations

We comparison of time-agnostic query masks. The results demonstrate that our method consistently extracts accurate object masks in both intact and fragmented cookie scenarios, whereas 4DLangSplat exhibits degraded performance when handling fragmented cases.

BibTeX

@inproceedings{4DLangVGGT,title = {4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer},
        author = {Xianfeng Wu and Yajing Bai and Minghan Li and Xianzu Wu and Xueqi Zhao and Zhongyuan Lai and Wenyu Liu and Xinggang Wang},
        year = 2025,
        booktitle = {ArXiv}}