Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation
- URL: http://arxiv.org/abs/2412.20171v1
- Date: Sat, 28 Dec 2024 14:59:48 GMT
- Title: Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation
- Authors: Guanglei Yang, Yongqiang Zhang, Wanlong Li, Yu Tang, Weize Shang, Feng Wen, Hongbo Zhang, Mingli Ding,
- Abstract summary: Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks.<n>CNNs struggle to model long-range dependencies explicitly due to the localized nature of convolution operations.<n>We introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation.
- Score: 17.023625615663665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks, however, they inherently struggle to model long-range dependencies explicitly due to the localized nature of convolution operations. Although Transformers have addressed limitations in long-range dependencies for the spatial dimension, the temporal dimension remains underexplored. In this paper, we first highlight that 3D CNNs exhibit limitations in capturing long-range temporal dependencies. Though Transformers mitigate spatial dimension issues, they result in a considerable increase in parameter and processing speed reduction. To overcome these challenges, we introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation. Specifically, we substitute the 3D CNN layers with ConvGRU in the temporal module to bolster the capacity of networks for handling temporal dependencies. Additionally, we integrate a geographical mask into the Convolutional Gated Recurrent Unit to suppress noise introduced by the temporal module. Comprehensive experiments conducted on the NuScenes dataset substantiate the merits of the proposed Geo-ConvGRU, revealing that our approach attains state-of-the-art performance in Bird's-Eye View segmentation.
Related papers
- Point Cloud Denoising With Fine-Granularity Dynamic Graph Convolutional Networks [58.050130177241186]
Noise perturbations often corrupt 3-D point clouds, hindering downstream tasks such as surface reconstruction, rendering, and further processing.
This paper introduces finegranularity dynamic graph convolutional networks called GDGCN, a novel approach to denoising in 3-D point clouds.
arXiv Detail & Related papers (2024-11-21T14:19:32Z) - Multivariate Time-Series Anomaly Detection based on Enhancing Graph Attention Networks with Topological Analysis [31.43159668073136]
Unsupervised anomaly detection in time series is essential in industrial applications, as it significantly reduces the need for manual intervention.
Traditional methods use Graph Neural Networks (GNNs) or Transformers to analyze spatial while RNNs to model temporal dependencies.
This paper introduces a novel temporal model built on an enhanced Graph Attention Network (GAT) for multivariate time series anomaly detection called TopoGDN.
arXiv Detail & Related papers (2024-08-23T14:06:30Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z) - Multi-Scale Spatial Temporal Graph Convolutional Network for
Skeleton-Based Action Recognition [13.15374205970988]
We present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module.
The MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolutions, forming a hierarchical residual architecture.
We propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition.
arXiv Detail & Related papers (2022-06-27T03:17:33Z) - Contextual Attention Network: Transformer Meets U-Net [0.0]
convolutional neural networks (CNN) have become the de facto standard and attained immense success in medical image segmentation.
However, CNN based methods fail to build long-range dependencies and global context connections.
Recent articles have exploited Transformer variants for medical image segmentation tasks.
arXiv Detail & Related papers (2022-03-02T21:10:24Z) - Deep Geospatial Interpolation Networks [15.942343748489376]
We propose a novel deep neural network called as Deep Geospatial Interpolation Network(DGIN)
DGIN incorporates both spatial and temporal relationships and has significantly lower training time.
We evaluate DGIN on the MODIS dataset from two different regions.
arXiv Detail & Related papers (2021-08-15T06:57:36Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.