UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track
- URL: http://arxiv.org/abs/2408.10129v2
- Date: Sat, 24 Aug 2024 13:09:26 GMT
- Title: UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track
- Authors: Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong,
- Abstract summary: We finetune RVOS model to obtain mask sequences correlated with language descriptions.
We leverage VOS model to enhance the quality and temporal consistency of the mask results.
Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.
- Score: 28.52754012142431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.
Related papers
- LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation [124.50550604020684]
This paper introduces the 6th Large-scale Video Object (LSVOS) challenge in conjunction with ECCV 2024 workshop.
This year's challenge includes two tasks: Video Object (VOS) and Referring Video Object (RVOS)
This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries.
arXiv Detail & Related papers (2024-09-09T17:45:45Z) - Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS [68.47681139026666]
Video object segmentation (VOS) is a crucial task in computer vision.
Current VOS methods struggle with complex scenes and prolonged object motions.
This report introduces a discriminative spatial-temporal VOS model.
arXiv Detail & Related papers (2024-08-29T10:47:17Z) - 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS)
Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z) - LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation [29.07092353094942]
Video object segmentation (VOS) aims to distinguish and track target objects in a video.
Existing benchmarks mainly focus on short-term videos, where objects remain visible most of the time.
We propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations.
Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets.
arXiv Detail & Related papers (2024-04-30T07:50:29Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - 1st Place Solution for the 5th LSVOS Challenge: Video Instance
Segmentation [25.587080499097425]
We present further improvements to the SOTA VIS method, DVIS.
We introduce a denoising training strategy for the trainable tracker, allowing it to achieve more stable and accurate object tracking in complex and long videos.
Our method achieves 57.9 AP and 56.0 AP in the development and test phases, respectively, and ranked 1st in the VIS track of the 5th LSVOS Challenge.
arXiv Detail & Related papers (2023-08-28T08:15:43Z) - Scalable Video Object Segmentation with Simplified Framework [21.408446548059956]
This paper presents a scalable VOS (SimVOS) framework to perform joint feature extraction and matching.
SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features.
Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-19T04:30:48Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.