1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object
Segmentation
- URL: http://arxiv.org/abs/2212.14679v1
- Date: Tue, 27 Dec 2022 09:22:45 GMT
- Title: 1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object
Segmentation
- Authors: Zhiwei Hu, Bo Chen, Yuan Gao, Zhilong Ji, Jinfeng Bai
- Abstract summary: We improve one-stage method ReferFormer to obtain mask sequences strongly correlated with language descriptions.
We leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results.
Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge.
- Score: 12.100628128028385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of referring video object segmentation aims to segment the object in
the frames of a given video to which the referring expressions refer. Previous
methods adopt multi-stage approach and design complex pipelines to obtain
promising results. Recently, the end-to-end method based on Transformer has
proved its superiority. In this work, we draw on the advantages of the above
methods to provide a simple and effective pipeline for RVOS. Firstly, We
improve the state-of-the-art one-stage method ReferFormer to obtain mask
sequences that are strongly correlated with language descriptions. Secondly,
based on a reliable and high-quality keyframe, we leverage the superior
performance of video object segmentation model to further enhance the quality
and temporal consistency of the mask results. Our single model reaches 70.3 J
&F on the Referring Youtube-VOS validation set and 63.0 on the test set. After
ensemble, we achieve 64.1 on the final leaderboard, ranking 1st place on
CVPR2022 Referring Youtube-VOS challenge. Code will be available at
https://github.com/Zhiweihhh/cvpr2022-rvos-challenge.git.
Related papers
- 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z) - Learning Cross-Modal Affinity for Referring Video Object Segmentation
Targeting Limited Samples [61.66967790884943]
Referring video object segmentation (RVOS) relies on sufficient data for a given scene.
In more realistic scenarios, only minimal annotations are available for a new scene.
We propose a model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture.
CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios.
arXiv Detail & Related papers (2023-09-05T08:34:23Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - 5th Place Solution for YouTube-VOS Challenge 2022: Video Object
Segmentation [4.004851693068654]
Video object segmentation (VOS) has made significant progress with the rise of deep learning.
Similar objects are easily confused and tiny objects are difficult to find.
We propose a simple yet effective solution for this task.
arXiv Detail & Related papers (2022-06-20T06:14:27Z) - Language as Queries for Referring Video Object Segmentation [23.743637144137498]
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer.
It views the language as queries and directly attends to the most relevant regions in the video frames.
arXiv Detail & Related papers (2022-01-03T05:54:00Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - A Transductive Approach for Video Object Segmentation [55.83842083823267]
Semi-supervised video object segmentation aims to separate a target object from a video sequence, given the mask in the first frame.
Most of current prevailing methods utilize information from additional modules trained in other domains like optical flow and instance segmentation.
We propose a simple yet strong transductive method, in which additional modules, datasets, and dedicated architectural designs are not needed.
arXiv Detail & Related papers (2020-04-15T16:39:36Z) - Learning What to Learn for Video Object Segmentation [157.4154825304324]
We introduce an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module.
This internal learner is designed to predict a powerful parametric model of the target.
We set a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5.
arXiv Detail & Related papers (2020-03-25T17:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.