Language as Queries for Referring Video Object Segmentation
- URL: http://arxiv.org/abs/2201.00487v1
- Date: Mon, 3 Jan 2022 05:54:00 GMT
- Title: Language as Queries for Referring Video Object Segmentation
- Authors: Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo
- Abstract summary: Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer.
It views the language as queries and directly attends to the most relevant regions in the video frames.
- Score: 23.743637144137498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation (R-VOS) is an emerging cross-modal task
that aims to segment the target object referred by a language expression in all
video frames. In this work, we propose a simple and unified framework built
upon Transformer, termed ReferFormer. It views the language as queries and
directly attends to the most relevant regions in the video frames. Concretely,
we introduce a small set of object queries conditioned on the language as the
input to the Transformer. In this manner, all the queries are obligated to find
the referred objects only. They are eventually transformed into dynamic kernels
which capture the crucial object-level information, and play the role of
convolution filters to generate the segmentation masks from feature maps. The
object tracking is achieved naturally by linking the corresponding queries
across frames. This mechanism greatly simplifies the pipeline and the
end-to-end framework is significantly different from the previous methods.
Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and
JHMDB-Sentences show the effectiveness of ReferFormer. On Ref-Youtube-VOS,
Refer-Former achieves 55.6J&F with a ResNet-50 backbone without bells and
whistles, which exceeds the previous state-of-the-art performance by 8.4
points. In addition, with the strong Swin-Large backbone, ReferFormer achieves
the best J&F of 62.4 among all existing methods. The J&F metric can be further
boosted to 63.3 by adopting a simple post-process technique. Moreover, we show
the impressive results of 55.0 mAP and 43.7 mAP on A2D-Sentences
andJHMDB-Sentences respectively, which significantly outperforms the previous
methods by a large margin. Code is publicly available at
https://github.com/wjn922/ReferFormer.
Related papers
- 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z) - OnlineRefer: A Simple Online Baseline for Referring Video Object
Segmentation [75.07460026246582]
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction.
Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding.
We propose a simple yet effective online model using explicit query propagation, named OnlineRefer.
arXiv Detail & Related papers (2023-07-18T15:43:35Z) - 1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object
Segmentation [12.100628128028385]
We improve one-stage method ReferFormer to obtain mask sequences strongly correlated with language descriptions.
We leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results.
Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge.
arXiv Detail & Related papers (2022-12-27T09:22:45Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Object-aware Video-language Pre-training for Retrieval [24.543719616308945]
We present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.
We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture.
arXiv Detail & Related papers (2021-12-01T17:06:39Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - A Transductive Approach for Video Object Segmentation [55.83842083823267]
Semi-supervised video object segmentation aims to separate a target object from a video sequence, given the mask in the first frame.
Most of current prevailing methods utilize information from additional modules trained in other domains like optical flow and instance segmentation.
We propose a simple yet strong transductive method, in which additional modules, datasets, and dedicated architectural designs are not needed.
arXiv Detail & Related papers (2020-04-15T16:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.