Revisiting Sequence-to-Sequence Video Object Segmentation with
Multi-Task Loss and Skip-Memory
- URL: http://arxiv.org/abs/2004.12170v1
- Date: Sat, 25 Apr 2020 15:38:09 GMT
- Title: Revisiting Sequence-to-Sequence Video Object Segmentation with
Multi-Task Loss and Skip-Memory
- Authors: Fatemeh Azimi, Benjamin Bischke, Sebastian Palacio, Federico Raue,
Joern Hees, Andreas Dengel
- Abstract summary: Video Object (VOS) is an active research area of the visual domain.
Current approaches lose objects in longer sequences, especially when the object is small or briefly occluded.
We build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data.
- Score: 4.343892430915579
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Object Segmentation (VOS) is an active research area of the visual
domain. One of its fundamental sub-tasks is semi-supervised / one-shot
learning: given only the segmentation mask for the first frame, the task is to
provide pixel-accurate masks for the object over the rest of the sequence.
Despite much progress in the last years, we noticed that many of the existing
approaches lose objects in longer sequences, especially when the object is
small or briefly occluded. In this work, we build upon a sequence-to-sequence
approach that employs an encoder-decoder architecture together with a memory
module for exploiting the sequential data. We further improve this approach by
proposing a model that manipulates multi-scale spatio-temporal information
using memory-equipped skip connections. Furthermore, we incorporate an
auxiliary task based on distance classification which greatly enhances the
quality of edges in segmentation masks. We compare our approach to the state of
the art and show considerable improvement in the contour accuracy metric and
the overall segmentation accuracy.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and
Bootstrapped Self-training [13.985488693082981]
We propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks.
We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-22T07:27:09Z) - Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Multi-task deep learning for image segmentation using recursive
approximation tasks [5.735162284272276]
Deep neural networks for segmentation usually require a massive amount of pixel-level labels which are manually expensive to create.
In this work, we develop a multi-task learning method to relax this constraint.
The network is trained on an extremely small amount of precisely segmented images and a large set of coarse labels.
arXiv Detail & Related papers (2020-05-26T21:35:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.