- the first object-centric representation for 3D semantic occupancy prediction
- supervised
- comparable performance but drastically reduced memory consumption
- Grid-based methods inevitably suffer from the redundancy of empty grids, resulting in more complexity for downstream tasks.
- It is also more difficult to capture scene dynamics with grid-based representations since it is objects instead of grids that move in the 3D space
- Dense voxel representation neglects this diversity and processes every 3D location with equal storage and computation resources, which often leads to intractable overhead because of unreasonable resource allocation.
- Although planar representations are resource-friendly, they could cause a loss of details. The grid-based methods can hardly adapt to regions of interest for different scenes and thus lead to representation and computation redundancy.
- Object-centric 3D representation for 3D semantic occupancy prediction: each unit describes a region of interest instead of fixed grids.
- Construct semantic Gaussians using: mean, scale, rotation vectors and semantic logits
- Generating semantics from a gaussian:
$$g(p; m, s, r, c) = \exp (−\frac{1}{2}(p − m)^\top\sum_{-1}(p − m))c$$ - Generating occupancy from a gaussian:
$$\hat{o}(p;\mathcal{G}) = \sum_i g_i((p; m_i, s_i, r_i, c_i))$$ $\mathcal{G}$ : a set of 3d gaussians
$p$ : coordinates of a 3d point
Iteratively refine the Gaussian properties within the B blocks of GaussianFormer. Each block consists of:
- self-encoding module: enable interactions among 3D Gaussians; implemented by sparse conv
- image cross-attention module: aggregate visual information
- refinement module: rectify the properties of 3D Gaussians
nuScenes validation set:
- IoU: 29.83
- MIoU: 19.10
SSCBench-KITTI360 validation set:
- IoU: 35.38
- MIoU: 12.92