Boosting Salient Object Detection with Transformer-based Asymmetric
Bilateral U-Net
- URL: http://arxiv.org/abs/2108.07851v6
- Date: Mon, 21 Aug 2023 05:47:52 GMT
- Title: Boosting Salient Object Detection with Transformer-based Asymmetric
Bilateral U-Net
- Authors: Yu Qiu, Yun Liu, Le Zhang, Jing Xu
- Abstract summary: Existing salient object detection (SOD) methods mainly rely on U-shaped convolution neural networks (CNNs) with skip connections.
We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net) to learn both global and local representations for SOD.
ABiU-Net performs favorably against previous state-of-the-art SOD methods.
- Score: 19.21709807149165
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing salient object detection (SOD) methods mainly rely on U-shaped
convolution neural networks (CNNs) with skip connections to combine the global
contexts and local spatial details that are crucial for locating salient
objects and refining object details, respectively. Despite great successes, the
ability of CNNs in learning global contexts is limited. Recently, the vision
transformer has achieved revolutionary progress in computer vision owing to its
powerful modeling of global dependencies. However, directly applying the
transformer to SOD is suboptimal because the transformer lacks the ability to
learn local spatial representations. To this end, this paper explores the
combination of transformers and CNNs to learn both global and local
representations for SOD. We propose a transformer-based Asymmetric Bilateral
U-Net (ABiU-Net). The asymmetric bilateral encoder has a transformer path and a
lightweight CNN path, where the two paths communicate at each encoder stage to
learn complementary global contexts and local spatial details, respectively.
The asymmetric bilateral decoder also consists of two paths to process features
from the transformer and CNN encoder paths, with communication at each decoder
stage for decoding coarse salient object locations and fine-grained object
details, respectively. Such communication between the two encoder/decoder paths
enables AbiU-Net to learn complementary global and local representations,
taking advantage of the natural merits of transformers and CNNs, respectively.
Hence, ABiU-Net provides a new perspective for transformer-based SOD. Extensive
experiments demonstrate that ABiU-Net performs favorably against previous
state-of-the-art SOD methods. The code is available at
https://github.com/yuqiuyuqiu/ABiU-Net.
Related papers
- Interaction-Guided Two-Branch Image Dehazing Network [1.26404863283601]
Image dehazing aims to restore clean images from hazy ones.
CNNs and Transformers have demonstrated exceptional performance in local and global feature extraction.
We propose a novel dual-branch image dehazing framework that guides CNN and Transformer components interactively.
arXiv Detail & Related papers (2024-10-14T03:21:56Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - CompletionFormer: Depth Completion with Convolutions and Vision
Transformers [0.0]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
arXiv Detail & Related papers (2023-04-25T17:59:47Z) - ConvFormer: Combining CNN and Transformer for Medical Image Segmentation [17.88894109620463]
We propose a hierarchical CNN and Transformer hybrid architecture, called ConvFormer, for medical image segmentation.
Our ConvFormer, trained from scratch, outperforms various CNN- or Transformer-based architectures, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-11-15T23:11:22Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient
object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection.
The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z) - EDTER: Edge Detection with Transformer [71.83960813880843]
We propose a novel transformer-based edge detector, emphEdge Detection TransformER (EDTER), to extract clear and crisp object boundaries and meaningful edges.
EDTER exploits the full image context information and detailed local cues simultaneously.
Experiments on BSDS500, NYUDv2, and Multicue demonstrate the superiority of EDTER in comparison with state-of-the-arts.
arXiv Detail & Related papers (2022-03-16T11:55:55Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.