RefVOS: A Closer Look at Referring Expressions for Video Object
Segmentation
- URL: http://arxiv.org/abs/2010.00263v1
- Date: Thu, 1 Oct 2020 09:10:53 GMT
- Title: RefVOS: A Closer Look at Referring Expressions for Video Object
Segmentation
- Authors: Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos,
Jordi Torres and Xavier Giro-i-Nieto
- Abstract summary: We use a novel neural network to analyze the results of language-guided image segmentation and state of the art results for language-guided VOS.
Our study indicates that the major challenges for the task are related to understanding motion and static actions.
- Score: 8.80595950124721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of video object segmentation with referring expressions
(language-guided VOS) is to, given a linguistic phrase and a video, generate
binary masks for the object to which the phrase refers. Our work argues that
existing benchmarks used for this task are mainly composed of trivial cases, in
which referents can be identified with simple phrases. Our analysis relies on a
new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets
into trivial and non-trivial REs, with the non-trivial REs annotated with seven
RE semantic categories. We leverage this data to analyze the results of RefVOS,
a novel neural network that obtains competitive results for the task of
language-guided image segmentation and state of the art results for
language-guided VOS. Our study indicates that the major challenges for the task
are related to understanding motion and static actions.
Related papers
- VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes [11.575313825919205]
We introduce a novel task called Reference Audio-Visual Traditional (Ref-AVS)
Ref-AVS seeks to segment objects based on expressions containing multimodal cues.
We propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance.
arXiv Detail & Related papers (2024-07-15T17:54:45Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Towards Robust Referring Video Object Segmentation with Cyclic
Relational Consensus [42.14174599341824]
Referring Video Object (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video.
In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches.
arXiv Detail & Related papers (2022-07-04T05:08:09Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Boundary Knowledge Translation based Reference Semantic Segmentation [62.60078935335371]
We introduce a Reference Reference segmentation Network (Ref-Net) to conduct visual boundary knowledge translation.
Inspired by the human recognition mechanism, RSMTM is devised only to segment the same category objects based on the features of the reference objects.
With tens of finely-grained annotated samples as guidance, Ref-Net achieves results on par with fully supervised methods on six datasets.
arXiv Detail & Related papers (2021-08-01T07:40:09Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation [47.7867284770227]
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-19T09:31:08Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.