Towards Robust Referring Video Object Segmentation with Cyclic
Relational Consensus
- URL: http://arxiv.org/abs/2207.01203v3
- Date: Fri, 18 Aug 2023 18:48:33 GMT
- Title: Towards Robust Referring Video Object Segmentation with Cyclic
Relational Consensus
- Authors: Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, Yan Lu
- Abstract summary: Referring Video Object (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video.
In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches.
- Score: 42.14174599341824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Video Object Segmentation (R-VOS) is a challenging task that aims
to segment an object in a video based on a linguistic expression. Most existing
R-VOS methods have a critical assumption: the object referred to must appear in
the video. This assumption, which we refer to as semantic consensus, is often
violated in real-world scenarios, where the expression may be queried against
false videos. In this work, we highlight the need for a robust R-VOS model that
can handle semantic mismatches. Accordingly, we propose an extended task called
Robust R-VOS, which accepts unpaired video-text inputs. We tackle this problem
by jointly modeling the primary R-VOS problem and its dual (text
reconstruction). A structural text-to-text cycle constraint is introduced to
discriminate semantic consensus between video-text pairs and impose it in
positive pairs, thereby achieving multi-modal alignment from both positive and
negative pairs. Our structural constraint effectively addresses the challenge
posed by linguistic diversity, overcoming the limitations of previous methods
that relied on the point-wise constraint. A new evaluation dataset,
R\textsuperscript{2}-Youtube-VOSis constructed to measure the model robustness.
Our model achieves state-of-the-art performance on R-VOS benchmarks,
Ref-DAVIS17 and Ref-Youtube-VOS, and also our
R\textsuperscript{2}-Youtube-VOS~dataset.
Related papers
- ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.
We present textbfReferDINO, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Towards Robust Referring Image Segmentation [80.53860642199412]
Referring Image (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions.
We propose a new formulation of RIS, named Robust Referring Image (R-RIS)
We create three R-RIS datasets by augmenting existing RIS datasets with negative sentences.
We propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module.
arXiv Detail & Related papers (2022-09-20T08:48:26Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Hybrid-S2S: Video Object Segmentation with Recurrent Networks and
Correspondence Matching [3.9053553775979086]
One-shot Video Object(VOS) is the task of tracking an object of interest within a video sequence.
We study an RNN-based architecture and address some of these issues by proposing a hybrid sequence-to-sequence architecture named HS2S.
Our experiments show that augmenting the RNN with correspondence matching is a highly effective solution to reduce the drift problem.
arXiv Detail & Related papers (2020-10-10T19:00:43Z) - RefVOS: A Closer Look at Referring Expressions for Video Object
Segmentation [8.80595950124721]
We use a novel neural network to analyze the results of language-guided image segmentation and state of the art results for language-guided VOS.
Our study indicates that the major challenges for the task are related to understanding motion and static actions.
arXiv Detail & Related papers (2020-10-01T09:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.