Related papers: FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

URL: http://arxiv.org/abs/2507.02250v1
Date: Thu, 03 Jul 2025 02:58:39 GMT
Title: FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model
Authors: Jiangxia Chen, Tongyuan Huang, Ke Song,
Abstract summary: This paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction.<n>Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.
Score: 1.3220884102442592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.

Related papers

DIMM: Decoupled Multi-hierarchy Kalman Filter for 3D Object Tracking [50.038098341549095]
State estimation is challenging for 3D object tracking with high maneuverability.<n>We propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction.<n>DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%99.23%.
arXiv Detail & Related papers (2025-05-18T10:12:41Z)
MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation [8.113965240054506]
We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction.<n>HVFR improves performance by enhancing features for critical voxels, reducing computational cost.<n>MOD introduces an occluded' class to better handle regions obscured from sensor view, improving accuracy.<n>PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism.
arXiv Detail & Related papers (2024-12-29T14:39:21Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction [5.285847977231642]
3D semantic occupancy prediction is crucial for ensuring the safety in autonomous driving. Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features. We propose OccLoff, a framework that Learns to optimize Feature Fusion for 3D occupancy prediction.
arXiv Detail & Related papers (2024-11-06T06:34:27Z)
OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries. OPUS incorporates a suite of non-trivial strategies to enhance model performance. Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z)
Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction [14.866463843514156]
Let Occ Flow is the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs. Our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies. Our method extends differentiable rendering to 3D volumetric flow fields.
arXiv Detail & Related papers (2024-07-10T12:20:11Z)
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z)
Object Detection in Thermal Images Using Deep Learning for Unmanned Aerial Vehicles [0.9208007322096533]
This work presents a neural network model capable of recognizing small and tiny objects in thermal images collected by unmanned aerial vehicles. The backbone is developed based on the structure of YOLOv5 combined with the use of a transformer encoder at the end. The neck includes a BI-FPN block combined with the use of a sliding window and a transformer to increase the information fed into the prediction head.
arXiv Detail & Related papers (2024-02-13T06:40:55Z)
DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model [20.15214479105187]
We propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. Our method achieves an unprecedented millimeter-level accuracy (0.0078m in EPE3D) on the KITTI dataset.
arXiv Detail & Related papers (2023-11-29T08:56:24Z)
EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling. Coarse predictions are generated in the first stage via a voxel-based region proposal network. Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.