GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction
- URL: http://arxiv.org/abs/2507.20963v1
- Date: Mon, 28 Jul 2025 16:18:29 GMT
- Title: GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction
- Authors: Tianhao Li, Yang Li, Mengtian Li, Yisheng Deng, Weifeng Ge,
- Abstract summary: We propose a global temporal aggregation denoising network named GTAD for holistic 3D scene understanding.<n>Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences.
- Score: 14.549066678968368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately perceiving dynamic environments is a fundamental task for autonomous driving and robotic systems. Existing methods inadequately utilize temporal information, relying mainly on local temporal interactions between adjacent frames and failing to leverage global sequence information effectively. To address this limitation, we investigate how to effectively aggregate global temporal features from temporal sequences, aiming to achieve occupancy representations that efficiently utilize global temporal information from historical observations. For this purpose, we propose a global temporal aggregation denoising network named GTAD, introducing a global temporal information aggregation framework as a new paradigm for holistic 3D scene understanding. Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences. This approach enables the effective perception of both fine-grained temporal information from adjacent frames and global temporal patterns from historical observations. As a result, it provides a more coherent and comprehensive understanding of the environment. Extensive experiments on the nuScenes and Occ3D-nuScenes benchmark and ablation studies demonstrate the superiority of our method.
Related papers
- Multivariate Long-term Time Series Forecasting with Fourier Neural Filter [55.09326865401653]
We introduce FNF as the backbone and DBD as architecture to provide excellent learning capabilities and optimal learning pathways for spatial-temporal modeling.<n>We show that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling.
arXiv Detail & Related papers (2025-06-10T18:40:20Z) - Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction [62.69089767730514]
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc)<n>It opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies.
arXiv Detail & Related papers (2025-04-17T14:05:33Z) - Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction [12.064509280163502]
3D occupancy prediction has emerged as a key perception task for autonomous driving.<n>Recent studies focus on integrating information obtained from past observations to improve prediction accuracy.<n>We propose StreamOcc, a framework that aggregates past-temporal information in a stream-based manner.<n>Experiments on the Occ3D-nus dataset show that StreamOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by more than 50% compared to previous methods.
arXiv Detail & Related papers (2025-03-28T02:05:53Z) - Dual Frequency Branch Framework with Reconstructed Sliding Windows Attention for AI-Generated Image Detection [12.523297358258345]
Generative Adversarial Networks (GANs) and diffusion models have enabled the creation of highly realistic synthetic images.<n>Generative Adversarial Networks (GANs) and diffusion models have enabled the creation of highly realistic synthetic images.<n> detecting AI-generated images has emerged as a critical challenge.
arXiv Detail & Related papers (2025-01-25T15:53:57Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - A Generic Approach to Integrating Time into Spatial-Temporal Forecasting
via Conditional Neural Fields [1.7661845949769062]
This paper presents a general approach to integrating the time component into forecasting models.
The main idea is to employ conditional neural fields to represent the auxiliary features extracted from the time component.
Experiments on road traffic and cellular network traffic datasets prove the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-05-11T14:20:23Z) - Self-Supervised Temporal Graph learning with Temporal and Structural Intensity Alignment [53.72873672076391]
Temporal graph learning aims to generate high-quality representations for graph-based tasks with dynamic information.
We propose a self-supervised method called S2T for temporal graph learning, which extracts both temporal and structural information.
S2T achieves at most 10.13% performance improvement compared with the state-of-the-art competitors on several datasets.
arXiv Detail & Related papers (2023-02-15T06:36:04Z) - STJLA: A Multi-Context Aware Spatio-Temporal Joint Linear Attention
Network for Traffic Forecasting [7.232141271583618]
We propose a novel deep learning model for traffic forecasting named inefficient-Context Spatio-Temporal Joint Linear Attention (SSTLA)
SSTLA applies linear attention to a joint graph to capture global dependence between alltemporal- nodes efficiently.
Experiments on two real-world traffic datasets, England and Temporal7, demonstrate that our STJLA can achieve 9.83% and 3.08% 3.08% accuracy in MAE measure over state-of-the-art baselines.
arXiv Detail & Related papers (2021-12-04T06:39:18Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - A Novel Framework for Spatio-Temporal Prediction of Environmental Data
Using Deep Learning [0.0]
We introduce here a framework for decomposed-temporal prediction of climate and environmental data using deep learning.
Specifically, we introduce functions which can be spatially and mapped on a regular grid allowing the reconstruction of complete-temporal-signal.
Applications on simulated real-world data will show the effectiveness of the proposed framework.
arXiv Detail & Related papers (2020-07-23T07:44:04Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.