Feature boosting with efficient attention for scene parsing
- URL: http://arxiv.org/abs/2402.19250v1
- Date: Thu, 29 Feb 2024 15:22:21 GMT
- Title: Feature boosting with efficient attention for scene parsing
- Authors: Vivek Singh, Shailza Sharma and Fabio Cuzzolin
- Abstract summary: This paper presents a novel feature-boosting network that gathers context from multiple levels of feature extraction.
It computes the attention weights for each level of representation to generate the final class labels.
The proposed model outperforms all state-of-the-art models on both the ADE20K and the Cityscapes datasets.
- Score: 6.752935599738123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The complexity of scene parsing grows with the number of object and scene
classes, which is higher in unrestricted open scenes. The biggest challenge is
to model the spatial relation between scene elements while succeeding in
identifying objects at smaller scales. This paper presents a novel
feature-boosting network that gathers spatial context from multiple levels of
feature extraction and computes the attention weights for each level of
representation to generate the final class labels. A novel `channel attention
module' is designed to compute the attention weights, ensuring that features
from the relevant extraction stages are boosted while the others are
attenuated. The model also learns spatial context information at low resolution
to preserve the abstract spatial relationships among scene elements and reduce
computation cost. Spatial attention is subsequently concatenated into a final
feature set before applying feature boosting. Low-resolution spatial attention
features are trained using an auxiliary task that helps learning a coarse
global scene structure. The proposed model outperforms all state-of-the-art
models on both the ADE20K and the Cityscapes datasets.
Related papers
- Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.
Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving [12.406655155106424]
We propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training.
Our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals.
Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs.
Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level.
arXiv Detail & Related papers (2024-10-30T15:00:06Z) - Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - NEAT: Neural Attention Fields for End-to-End Autonomous Driving [59.60483620730437]
We present NEural ATtention fields (NEAT), a novel representation that enables efficient reasoning for imitation learning models.
NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics.
In a new evaluation setting involving adverse environmental conditions and challenging scenarios, NEAT outperforms several strong baselines and achieves driving scores on par with the privileged CARLA expert.
arXiv Detail & Related papers (2021-09-09T17:55:28Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Multi-layer Feature Aggregation for Deep Scene Parsing Models [19.198074549944568]
In this paper, we explore the effective use of multi-layer feature outputs of the deep parsing networks for spatial-semantic consistency.
The proposed module can auto-select the intermediate visual features to correlate the spatial and semantic information.
Experiments on four public scene parsing datasets prove that the deep parsing network equipped with the proposed feature aggregation module can achieve very promising results.
arXiv Detail & Related papers (2020-11-04T23:07:07Z) - Mutual Information Maximization for Robust Plannable Representations [82.83676853746742]
We present MIRO, an information theoretic representational learning algorithm for model-based reinforcement learning.
We show that our approach is more robust than reconstruction objectives in the presence of distractors and cluttered scenes.
arXiv Detail & Related papers (2020-05-16T21:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.