M$^3$Net: Multilevel, Mixed and Multistage Attention Network for Salient
Object Detection
- URL: http://arxiv.org/abs/2309.08365v1
- Date: Fri, 15 Sep 2023 12:46:14 GMT
- Title: M$^3$Net: Multilevel, Mixed and Multistage Attention Network for Salient
Object Detection
- Authors: Yao Yuan, Pan Gao, XiaoYang Tan
- Abstract summary: M$3$Net is an attention network for Salient Object Detection.
Cross-attention approach to achieve the interaction between multilevel features.
Mixed Attention Block aims at modeling context at both global and local levels.
Multilevel supervision strategy to optimize the aggregated feature stage-by-stage.
- Score: 22.60675416709486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing salient object detection methods mostly use U-Net or feature
pyramid structure, which simply aggregates feature maps of different scales,
ignoring the uniqueness and interdependence of them and their respective
contributions to the final prediction. To overcome these, we propose the
M$^3$Net, i.e., the Multilevel, Mixed and Multistage attention network for
Salient Object Detection (SOD). Firstly, we propose Multiscale Interaction
Block which innovatively introduces the cross-attention approach to achieve the
interaction between multilevel features, allowing high-level features to guide
low-level feature learning and thus enhancing salient regions. Secondly,
considering the fact that previous Transformer based SOD methods locate salient
regions only using global self-attention while inevitably overlooking the
details of complex objects, we propose the Mixed Attention Block. This block
combines global self-attention and window self-attention, aiming at modeling
context at both global and local levels to further improve the accuracy of the
prediction map. Finally, we proposed a multilevel supervision strategy to
optimize the aggregated feature stage-by-stage. Experiments on six challenging
datasets demonstrate that the proposed M$^3$Net surpasses recent CNN and
Transformer-based SOD arts in terms of four metrics. Codes are available at
https://github.com/I2-Multimedia-Lab/M3Net.
Related papers
- Multimodal Collaboration Networks for Geospatial Vehicle Detection in Dense, Occluded, and Large-Scale Events [29.86323896541765]
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene.
Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments.
We propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet.
arXiv Detail & Related papers (2024-05-14T00:51:15Z) - Salient Object Detection in Optical Remote Sensing Images Driven by
Transformer [69.22039680783124]
We propose a novel Global Extraction Local Exploration Network (GeleNet) for Optical Remote Sensing Images (ORSI-SOD)
Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies.
Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods.
arXiv Detail & Related papers (2023-09-15T07:14:43Z) - Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z) - Bifurcated backbone strategy for RGB-D salient object detection [168.19708737906618]
We leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network.
Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent.
arXiv Detail & Related papers (2020-07-06T13:01:30Z) - DFNet: Discriminative feature extraction and integration network for
salient object detection [6.959742268104327]
We focus on two aspects of challenges in saliency detection using Convolutional Neural Networks.
Firstly, since salient objects appear in various sizes, using single-scale convolution would not capture the right size.
Secondly, using multi-level features helps the model use both local and global context.
arXiv Detail & Related papers (2020-04-03T13:56:41Z) - Global Context-Aware Progressive Aggregation Network for Salient Object
Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features.
We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z) - Cross-layer Feature Pyramid Network for Salient Object Detection [102.20031050972429]
We propose a novel Cross-layer Feature Pyramid Network to improve the progressive fusion in salient object detection.
The distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information.
arXiv Detail & Related papers (2020-02-25T14:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.