GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Abstract

3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents.

Framework

The GaussTR framework initiates with extracting multi-view features with pre-trained foundation models. A series of Transformer layers then predict sparse sets of Gaussian queries to represent the 3D scene. During the training phase, predicted Gaussians are rendered via differentiable splatting into source 2D views, enforcing alignment with 2D depth and features from foundation models. At inference, Gaussian features are converted into semantic logits by measuring similarity with text-embedded category vectors, followed by voxelization to produce volumetric predictions.

Results

GaussTR achieves state-of-the-art zero-shot performance of 12.27 mIoU, outperforming previous methods by 2.33 mIoU while reducing training time by 40%. GaussTR demonstrates performance superiority across diverse foundation models including FeatUp and Talk2DINO. These findings validate the scalability and generalization of sparse Gaussian-based 3D modeling and foundation model alignment for self-supervised spatial understanding.

Visulizations

We further analyze GaussTR’s cross-modal consistency by visualizing rendered 2D depth and segmentation maps of Gaussian predictions. To improve interpretability, we apply color perturbations to the semantic maps to highlight the distribution of individual Gaussians and reveal how they collectively reconstruct the scene layout. Additionally, GaussTR exhibits impressive generalization capability to novel and scarce categories, such as traffic lights and street signs. Owing to its alignment with visual-language models, GaussTR can seamlessly adapt to these categories, generating prominent activations in corresponding regions and further validating its versatility.

BibTeX

@inproceedings{GaussTR, title = {GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding}, author = {Haoyi Jiang and Liu Liu and Tianheng Cheng and Xinjie Wang and Tianwei Lin and Zhizhong Su and Wenyu Liu and Xinggang Wang}, year = 2025, booktitle = {CVPR} }