Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction
- URL: http://arxiv.org/abs/2508.10936v1
- Date: Tue, 12 Aug 2025 19:50:34 GMT
- Title: Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction
- Authors: Cheng Chen, Hao Huang, Saurabh Bagchi,
- Abstract summary: Collaborative perception enables connected vehicles to share information.<n>Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels.<n>We propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction.
- Score: 12.80732853899807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.
Related papers
- MCOP: Multi-UAV Collaborative Occupancy Prediction [40.58729551462363]
Current Bird's Eye View (BEV)-based approaches exhibit two main limitations.<n>We propose a novel multi-UAV collaborative occupancy prediction framework.<n>Our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods.
arXiv Detail & Related papers (2025-10-14T16:17:42Z) - Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning [64.32618490065117]
A core problem of Embodied AI is to learn object manipulation from observation, as humans do.<n>We propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy.<n> Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.
arXiv Detail & Related papers (2025-08-02T04:14:18Z) - A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving [3.6538681992157604]
3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving.<n>We augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor.<n>We develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation.
arXiv Detail & Related papers (2025-06-20T13:58:10Z) - TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy [14.075911467687789]
We propose a target-scale adaptive, symmetric retrieval mechanism for 3D semantic occupancy prediction.<n>It expands the neighborhood for large targets to enhance context awareness and shrinks it for small ones to improve efficiency and suppress noise.<n>In summary, we propose TACOcc, an adaptive multi-modal fusion framework for 3D semantic occupancy prediction, enhanced by volume rendering supervision.
arXiv Detail & Related papers (2025-05-19T04:32:36Z) - TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness [13.68631587423815]
3D semantic occupancy has rapidly become a research focus in the fields of robotics and autonomous driving environment perception.<n>Existing occupancy prediction tasks are modeled using voxel or point cloud-based approaches.<n>We propose a dual-modal prediction method based on 3D Gaussian sets and sparse points, which balances both spatial location and volumetric structural information.
arXiv Detail & Related papers (2025-03-13T01:35:04Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles [13.167432547990487]
We introduce the first method for collaborative 3D semantic occupancy prediction.
It improves local 3D semantic occupancy predictions by hybrid fusion of semantic and occupancy task features.
Our models anchor on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications.
arXiv Detail & Related papers (2024-02-12T13:19:08Z) - 2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic
Segmentation [92.17700318483745]
We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network.
IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points.
arXiv Detail & Related papers (2023-11-27T07:57:29Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.