3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive
semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive
labeled data and computationally intensive voxel-based modeling, restricting the scalability and
generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian
Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial
understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that
represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse
knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D
representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical
evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance,
achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results
highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with
promising implications for autonomous driving and embodied agents.
@article{GaussTR,
title = {GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding},
author = {Haoyi Jiang and Liu Liu and Tianheng Cheng and Xinjie Wang and Tianwei Lin and Zhizhong Su and Wenyu Liu and Xinggang Wang},
year = 2024,
journal = {arXiv preprint arXiv:2412.13193}
}