The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation
- URL: http://arxiv.org/abs/2206.12035v1
- Date: Fri, 24 Jun 2022 02:15:06 GMT
- Title: The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation
- Authors: Leilei Cao, Zhuang Li, Bo Yan, Feng Zhang, Fengliang Qi, Yuchen Hu and
Hongbin Wang
- Abstract summary: ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
- Score: 18.630453674396534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The referring video object segmentation task (RVOS) aims to segment object
instances in a given video referred by a language expression in all video
frames. Due to the requirement of understanding cross-modal semantics within
individual instances, this task is more challenging than the traditional
semi-supervised video object segmentation where the ground truth object masks
in the first frame are given. With the great achievement of Transformer in
object detection and object segmentation, RVOS has been made remarkable
progress where ReferFormer achieved the state-of-the-art performance. In this
work, based on the strong baseline framework--ReferFormer, we propose several
tricks to boost further, including cyclical learning rates, semi-supervised
approach, and test-time augmentation inference. The improved ReferFormer ranks
2nd place on CVPR2022 Referring Youtube-VOS Challenge.
Related papers
- VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [8.20168024462357]
Motion Expression guided Video is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions.
We introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement.
Our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
arXiv Detail & Related papers (2024-06-20T02:16:23Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - ALBA : Reinforcement Learning for Video Object Segmentation [11.29255792513528]
We consider the challenging problem of zero-shot video object segmentation (VOS)
We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time.
We show that the proposed method, which we call ALBA, outperforms the previous stateof-the-art on three benchmarks.
arXiv Detail & Related papers (2020-05-26T20:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.