Motion-Attentive Transition for Zero-Shot Video Object Segmentation
- URL: http://arxiv.org/abs/2003.04253v3
- Date: Thu, 9 Jul 2020 17:34:32 GMT
- Title: Motion-Attentive Transition for Zero-Shot Video Object Segmentation
- Authors: Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, Ling Shao
- Abstract summary: We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
- Score: 99.44383412488703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel Motion-Attentive Transition Network
(MATNet) for zero-shot video object segmentation, which provides a new way of
leveraging motion information to reinforce spatio-temporal object
representation. An asymmetric attention block, called Motion-Attentive
Transition (MAT), is designed within a two-stream encoder, which transforms
appearance features into motion-attentive representations at each convolutional
stage. In this way, the encoder becomes deeply interleaved, allowing for
closely hierarchical interactions between object motion and appearance. This is
superior to the typical two-stream architecture, which treats motion and
appearance separately in each stream and often suffers from overfitting to
appearance information. Additionally, a bridge network is proposed to obtain a
compact, discriminative and scale-sensitive representation for multi-level
encoder features, which is further fed into a decoder to achieve segmentation
results. Extensive experiments on three challenging public benchmarks (i.e.
DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling
performance against the state-of-the-arts.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z) - Efficient Unsupervised Video Object Segmentation Network Based on Motion
Guidance [1.5736899098702974]
This paper proposes a video object segmentation network based on motion guidance.
The model comprises a dual-stream network, motion guidance module, and multi-scale progressive fusion module.
The experimental results prove the superior performance of the proposed method.
arXiv Detail & Related papers (2022-11-10T06:13:23Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.