1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation
- URL: http://arxiv.org/abs/2401.00663v1
- Date: Mon, 1 Jan 2024 04:24:48 GMT
- Title: 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation
- Authors: Zhuoyan Luo, Yicheng Xiao, Yong Liu, Yitong Wang, Yansong Tang, Xiu
Li, Yujiu Yang
- Abstract summary: We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
- Score: 65.45702890457046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent transformer-based models have dominated the Referring Video Object
Segmentation (RVOS) task due to the superior performance. Most prior works
adopt unified DETR framework to generate segmentation masks in
query-to-instance manner. In this work, we integrate strengths of that leading
RVOS models to build up an effective paradigm. We first obtain binary mask
sequences from the RVOS models. To improve the consistency and quality of
masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally
ensembles RVOS models based on framework design as well as training strategy,
and leverages different video object segmentation (VOS) models to enhance mask
coherence by object propagation mechanism. Our method achieves 75.7% J&F on
Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place
on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3.
Code is available at https://github.com/RobertLuo1/iccv2023_RVOS_Challenge.
Related papers
- OneVOS: Unifying Video Object Segmentation with All-in-One Transformer
Framework [24.947436083365925]
OneVOS is a novel framework that unifies the core components of VOS with All-in-One Transformer.
OneVOS achieves state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% $J & F$, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively.
arXiv Detail & Related papers (2024-03-13T16:38:26Z) - Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z) - Learning Cross-Modal Affinity for Referring Video Object Segmentation
Targeting Limited Samples [61.66967790884943]
Referring video object segmentation (RVOS) relies on sufficient data for a given scene.
In more realistic scenarios, only minimal annotations are available for a new scene.
We propose a model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture.
CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios.
arXiv Detail & Related papers (2023-09-05T08:34:23Z) - 1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object
Segmentation [12.100628128028385]
We improve one-stage method ReferFormer to obtain mask sequences strongly correlated with language descriptions.
We leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results.
Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge.
arXiv Detail & Related papers (2022-12-27T09:22:45Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - ALBA : Reinforcement Learning for Video Object Segmentation [11.29255792513528]
We consider the challenging problem of zero-shot video object segmentation (VOS)
We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time.
We show that the proposed method, which we call ALBA, outperforms the previous stateof-the-art on three benchmarks.
arXiv Detail & Related papers (2020-05-26T20:57:28Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.