TokenCut: Segmenting Objects in Images and Videos with Self-supervised
Transformer and Normalized Cut
- URL: http://arxiv.org/abs/2209.00383v3
- Date: Tue, 5 Dec 2023 09:01:49 GMT
- Title: TokenCut: Segmenting Objects in Images and Videos with Self-supervised
Transformer and Normalized Cut
- Authors: Yangtao Wang (M-PSI), Xi Shen, Yuan Yuan (MIT CSAIL), Yuming Du,
Maomao Li, Shell Xu Hu, James L Crowley (M-PSI), Dominique Vaufreydaz (M-PSI)
- Abstract summary: We describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos.
Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks.
- Score: 9.609330588890632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we describe a graph-based algorithm that uses the features
obtained by a self-supervised transformer to detect and segment salient objects
in images and videos. With this approach, the image patches that compose an
image or video are organised into a fully connected graph, where the edge
between each pair of patches is labeled with a similarity score between patches
using features learned by the transformer. Detection and segmentation of
salient objects is then formulated as a graph-cut problem and solved using the
classical Normalized Cut algorithm. Despite the simplicity of this approach, it
achieves state-of-the-art results on several common image and video detection
and segmentation tasks. For unsupervised object discovery, this approach
outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%,
respectively, when tested with the VOC07, VOC12, and COCO20K datasets. For the
unsupervised saliency detection task in images, this method improves the score
for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the
ECSSD, DUTS, and DUT-OMRON datasets, respectively, compared to current
state-of-the-art techniques. This method also achieves competitive results for
unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS
datasets.
Related papers
- UnSeGArmaNet: Unsupervised Image Segmentation using Graph Neural Networks with Convolutional ARMA Filters [10.940349832919699]
We propose an unsupervised segmentation framework with a pre-trained ViT.
By harnessing the graph structure inherent within the image, the proposed method achieves a notable performance in segmentation.
The proposed method provides state-of-the-art performance (even comparable to supervised methods) on benchmark image segmentation datasets.
arXiv Detail & Related papers (2024-10-08T15:10:09Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - Video Segmentation Learning Using Cascade Residual Convolutional Neural
Network [0.0]
We propose a novel deep learning video segmentation approach that incorporates residual information into the foreground detection learning process.
Experiments conducted on the Change Detection 2014 and on the private dataset PetrobrasROUTES from Petrobras support the effectiveness of the proposed approach.
arXiv Detail & Related papers (2022-12-20T16:56:54Z) - Guess What Moves: Unsupervised Video and Image Segmentation by
Anticipating Motion [92.80981308407098]
We propose an approach that combines the strengths of motion-based and appearance-based segmentation.
We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns.
In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
arXiv Detail & Related papers (2022-05-16T17:55:34Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - Self-Supervised Transformers for Unsupervised Object Discovery using
Normalized Cut [0.0]
We demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image.
Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens.
For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.
arXiv Detail & Related papers (2022-02-23T14:27:36Z) - Box Supervised Video Segmentation Proposal Network [3.384080569028146]
We propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties.
The proposed method outperforms the state-of-the-art self-supervised benchmark by 16.4% and 6.9%.
We provide extensive tests and ablations on the datasets, demonstrating the robustness of our method.
arXiv Detail & Related papers (2022-02-14T20:38:28Z) - Deep ensembles based on Stochastic Activation Selection for Polyp
Segmentation [82.61182037130406]
This work deals with medical image segmentation and in particular with accurate polyp detection and segmentation during colonoscopy examinations.
Basic architecture in image segmentation consists of an encoder and a decoder.
We compare some variant of the DeepLab architecture obtained by varying the decoder backbone.
arXiv Detail & Related papers (2021-04-02T02:07:37Z) - Saliency Enhancement using Gradient Domain Edges Merging [65.90255950853674]
We develop a method to merge the edges with the saliency maps to improve the performance of the saliency.
This leads to our proposed saliency enhancement using edges (SEE) with an average improvement of at least 3.4 times higher on the DUT-OMRON dataset.
The SEE algorithm is split into 2 parts, SEE-Pre for preprocessing and SEE-Post pour postprocessing.
arXiv Detail & Related papers (2020-02-11T14:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.