GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Haoyi Jiang1 Liu Liu2 Tianheng Cheng1 Xinjie Wang2 Tianwei Lin2,
Zhizhong Su2 Wenyu Liu1 Xinggang Wang1
1Huazhong University of Science & Technology,  2Horizon Robotics

Abstract

3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents.


Framework

The GaussTR framework initiates with extracting features and depth estimation using various foundation models, including CLIP and Metric3D. Subsequently, GaussTR predicts a sparse set of Gaussian queries to represent the scene through a series of Transformer layers. During the training phase, the predicted Gaussians are splatted to source views and aligned with original 2D features for supervision. For inference, the Gaussians are converted into logits by measuring their similarity with category vectors, followed by voxelization to produce the final volumetric prediction.

Results

GaussTR demonstrates significant improvements over existing methods, outperforming by 1.76 mIoU while reducing training time by 50%. In addition, our approach enables zero-shot inference without requiring explicit label assignments during training, distinguishing it from other methods that rely on pre-generated segmentation labels for predefined categories. These findings underscore the efficacy of our proposed Gaussian-based sparse scene modeling and knowledge alignment with foundation models, which fosters generalizable 3D representations and reduces computational demands compared to previous methods.

Furthermore, GaussTR particularly excels with object-centric classes, e.g., cars, trucks, manmade structures, and vegetation. Notably, introducing auxiliary segmentation supervision as an augmentation further improves performance by 0.94 mIoU, especially for small object categories such as motorcycles and pedestrians, suggesting that detailed segmentation helps compensate for the finer granularity that CLIP features alone may lack.

Visulizations

we further present visualizations of the rendered depth and semantic maps of occupancy predictions in camera views. To improve interpretability, we apply color perturbations to the semantic maps to highlight the distribution of individual Gaussians and reveal how they collectively reconstruct the scene layout. Additionally, GaussTR exhibits impressive generalization capacity to novel and scarce categories, such as traffic lights and street signs. Owing to its visual-language alignment, GaussTR adapts seamlessly to these categories, generating prominent activations in the corresponding regions and further highlighting its versatility.

BibTeX

@article{GaussTR,
  title = {GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding}, 
  author = {Haoyi Jiang and Liu Liu and Tianheng Cheng and Xinjie Wang and Tianwei Lin and Zhizhong Su and Wenyu Liu and Xinggang Wang},
  year = 2024,
  journal = {arXiv preprint arXiv:2412.13193}
}