A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
- URL: http://arxiv.org/abs/2203.04708v2
- Date: Fri, 11 Mar 2022 07:37:37 GMT
- Title: A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
- Authors: Yukun Su, Jingliang Deng, Ruizhou Sun, Guosheng Lin, Qingyao Wu
- Abstract summary: Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
- Score: 59.21990697929617
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Humans tend to mine objects by learning from a group of images or several
frames of video since we live in a dynamic world. In the computer vision area,
many researches focus on co-segmentation (CoS), co-saliency detection (CoSD)
and video salient object detection (VSOD) to discover the co-occurrent objects.
However, previous approaches design different networks on these similar tasks
separately, and they are difficult to apply to each other, which lowers the
upper bound of the transferability of deep learning frameworks. Besides, they
fail to take full advantage of the cues among inter- and intra-feature within a
group of images. In this paper, we introduce a unified framework to tackle
these issues, term as UFO (Unified Framework for Co-Object Segmentation).
Specifically, we first introduce a transformer block, which views the image
feature as a patch token and then captures their long-range dependencies
through the self-attention mechanism. This can help the network to excavate the
patch structured similarities among the relevant objects. Furthermore, we
propose an intra-MLP learning module to produce self-mask to enhance the
network to avoid partial activation. Extensive experiments on four CoS
benchmarks (PASCAL, iCoseg, Internet and MSRC), three CoSD benchmarks
(Cosal2015, CoSOD3k, and CocA) and four VSOD benchmarks (DAVIS16, FBMS, ViSal
and SegV2) show that our method outperforms other state-of-the-arts on three
different tasks in both accuracy and speed by using the same network
architecture , which can reach 140 FPS in real-time.
Related papers
- FCC: Fully Connected Correlation for Few-Shot Segmentation [11.277022867553658]
Few-shot segmentation (FSS) aims to segment the target object in a query image using only a small set of support images and masks.
Previous methods have tried to obtain prior information by creating correlation maps from pixel-level correlation on final-layer or same-layer features.
We introduce FCC (Fully Connected Correlation) to integrate pixel-level correlations between support and query features.
arXiv Detail & Related papers (2024-11-18T03:32:02Z) - A Simple yet Effective Network based on Vision Transformer for
Camouflaged Object and Salient Object Detection [33.30644598646274]
We propose a simple yet effective network (SENet) based on vision Transformer (ViT)
To enhance the Transformer's ability to model local information, we propose a local information capture module (LICM)
We also propose a dynamic weighted loss (DW loss) based on Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss, which guides the network to pay more attention to those smaller and more difficult-to-find target objects.
arXiv Detail & Related papers (2024-02-29T07:29:28Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Associating Objects with Transformers for Video Object Segmentation [74.51719591192787]
We propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly.
AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space.
We ranked 1st in the 3rd Large-scale Video Object Challenge.
arXiv Detail & Related papers (2021-06-04T17:59:57Z) - CoSformer: Detecting Co-Salient Object with Transformers [2.3148470932285665]
Co-Salient Object Detection (CoSOD) aims at simulating the human visual system to discover the common and salient objects from a group of relevant images.
We propose the Co-Salient Object Detection Transformer (CoSformer) network to capture both salient and common visual patterns from multiple images.
arXiv Detail & Related papers (2021-04-30T02:39:12Z) - Target Detection and Segmentation in Circular-Scan
Synthetic-Aperture-Sonar Images using Semi-Supervised Convolutional
Encoder-Decoders [9.713290203986478]
We propose a saliency-based, multi-target detection and segmentation framework for multi-aspect, semi-coherent imagery.
Our framework relies on a multi-branch, convolutional encoder-decoder network (MB-CEDN)
We show that our framework outperforms supervised deep networks.
arXiv Detail & Related papers (2021-01-10T18:58:45Z) - Auto-Panoptic: Cooperative Multi-Component Architecture Search for
Panoptic Segmentation [144.50154657257605]
We propose an efficient framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module.
Our searched architecture, namely Auto-Panoptic, achieves the new state-of-the-art on the challenging COCO and ADE20K benchmarks.
arXiv Detail & Related papers (2020-10-30T08:34:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.