See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks
- URL: http://arxiv.org/abs/2001.06810v1
- Date: Sun, 19 Jan 2020 11:10:39 GMT
- Title: See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks
- Authors: Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih
Porikli
- Abstract summary: We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task.
We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism.
We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
- Score: 184.4379622593225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel network, called CO-attention Siamese Network (COSNet),
to address the unsupervised video object segmentation task from a holistic
view. We emphasize the importance of inherent correlation among video frames
and incorporate a global co-attention mechanism to improve further the
state-of-the-art deep learning based solutions that primarily focus on learning
discriminative foreground representations over appearance and motion in
short-term temporal segments. The co-attention layers in our network provide
efficient and competent stages for capturing global correlations and scene
context by jointly computing and appending co-attention responses into a joint
feature space. We train COSNet with pairs of video frames, which naturally
augments training data and allows increased learning capacity. During the
segmentation stage, the co-attention model encodes useful information by
processing multiple reference frames together, which is leveraged to infer the
frequently reappearing and salient foreground objects better. We propose a
unified and end-to-end trainable framework where different co-attention
variants can be derived for mining the rich context within videos. Our
extensive experiments over three large benchmarks manifest that COSNet
outperforms the current alternatives by a large margin.
Related papers
- SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Co-attention Propagation Network for Zero-Shot Video Object Segmentation [91.71692262860323]
Zero-shot object segmentation (ZS-VOS) aims to segment objects in a video sequence without prior knowledge of these objects.
Existing ZS-VOS methods often struggle to distinguish between foreground and background or to keep track of the foreground in complex scenarios.
We propose an encoder-decoder-based hierarchical co-attention propagation network (HCPN) capable of tracking and segmenting objects.
arXiv Detail & Related papers (2023-04-08T04:45:48Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - Event and Activity Recognition in Video Surveillance for Cyber-Physical
Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event.
We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event.
Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.