3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face
challenges in scalability and generalization due to their reliance on extensive labeled data and
computationally intensive voxel-wise representations.
In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D
modeling with foundation model alignment through Gaussian representations to advance 3D spatial
understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes.
By splatting the Gaussians into 2D views and aligning the rendered features with foundation models,
GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic
occupancy prediction without requiring explicit annotations.
Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot
performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the
efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in
autonomous driving and embodied agents.
@inproceedings{GaussTR,
title = {GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding},
author = {Haoyi Jiang and Liu Liu and Tianheng Cheng and Xinjie Wang and Tianwei Lin and Zhizhong Su and Wenyu Liu and Xinggang Wang},
year = 2025,
booktitle = {CVPR}
}