1st Place Solution for the 5th LSVOS Challenge: Video Instance
Segmentation
- URL: http://arxiv.org/abs/2308.14392v1
- Date: Mon, 28 Aug 2023 08:15:43 GMT
- Title: 1st Place Solution for the 5th LSVOS Challenge: Video Instance
Segmentation
- Authors: Tao Zhang, Xingye Tian, Yikang Zhou, Yu Wu, Shunping Ji, Cilin Yan,
Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan
- Abstract summary: We present further improvements to the SOTA VIS method, DVIS.
We introduce a denoising training strategy for the trainable tracker, allowing it to achieve more stable and accurate object tracking in complex and long videos.
Our method achieves 57.9 AP and 56.0 AP in the development and test phases, respectively, and ranked 1st in the VIS track of the 5th LSVOS Challenge.
- Score: 25.587080499097425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video instance segmentation is a challenging task that serves as the
cornerstone of numerous downstream applications, including video editing and
autonomous driving. In this report, we present further improvements to the SOTA
VIS method, DVIS. First, we introduce a denoising training strategy for the
trainable tracker, allowing it to achieve more stable and accurate object
tracking in complex and long videos. Additionally, we explore the role of
visual foundation models in video instance segmentation. By utilizing a frozen
VIT-L model pre-trained by DINO v2, DVIS demonstrates remarkable performance
improvements. With these enhancements, our method achieves 57.9 AP and 56.0 AP
in the development and test phases, respectively, and ultimately ranked 1st in
the VIS track of the 5th LSVOS Challenge. The code will be available at
https://github.com/zhang-tao-whu/DVIS.
Related papers
- CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track [35.70400178294299]
We introduce the solution of our team "yuanjie" for video object segmentation in the 6-th LSVOS Challenge VOS Track at ECCV 2024.
We believe that our proposed CSS-Segment will perform better in videos of complex object motion and long-term presentation.
Our method achieved a J&F score of 80.84 in and test phases, and ranked 2nd in the 6-th LSVOS Challenge VOS Track at ECCV 2024.
arXiv Detail & Related papers (2024-08-24T13:47:56Z) - UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track [28.52754012142431]
We finetune RVOS model to obtain mask sequences correlated with language descriptions.
We leverage VOS model to enhance the quality and temporal consistency of the mask results.
Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.
arXiv Detail & Related papers (2024-08-19T16:15:56Z) - 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS)
Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation [25.235404527487784]
Video panoptic segmentation is a challenging task that serves as the cornerstone of numerous downstream applications.
We believe that the decoupling strategy proposed by DVIS enables more effective utilization of temporal information for both "thing" and "stuff" objects.
Our method achieved a VPQ score of 51.4 and 53.7 in the development and test phases, respectively, and ranked 1st in the VPS track of the 2nd PVUW Challenge.
arXiv Detail & Related papers (2023-06-07T01:24:48Z) - The Runner-up Solution for YouTube-VIS Long Video Challenge 2022 [72.13080661144761]
We adopt the previously proposed online video instance segmentation method IDOL for this challenge.
We use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding.
The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second in this challenge.
arXiv Detail & Related papers (2022-11-18T01:40:59Z) - 5th Place Solution for YouTube-VOS Challenge 2022: Video Object
Segmentation [4.004851693068654]
Video object segmentation (VOS) has made significant progress with the rise of deep learning.
Similar objects are easily confused and tiny objects are difficult to find.
We propose a simple yet effective solution for this task.
arXiv Detail & Related papers (2022-06-20T06:14:27Z) - Unsupervised Domain Adaptation for Video Semantic Segmentation [91.30558794056054]
Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real.
In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic approaches.
We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
arXiv Detail & Related papers (2021-07-23T07:18:20Z) - TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training
Model [51.14840210957289]
Multi-object tracking is a fundamental vision problem that has been studied for a long time.
Despite the success of Tracking by Detection (TBD), this two-step method is too complicated to train in an end-to-end manner.
We propose a concise end-to-end model TubeTK which only needs one step training by introducing the bounding-tube" to indicate temporal-spatial locations of objects in a short video clip.
arXiv Detail & Related papers (2020-06-10T06:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.