Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation
- URL: http://arxiv.org/abs/2309.11933v1
- Date: Thu, 21 Sep 2023 09:47:47 GMT
- Title: Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation
- Authors: Ping Li and Yu Zhang and Li Yuan and Xianghua Xu
- Abstract summary: We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
- Score: 24.814534011440877
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Referring Video Object Segmentation (RVOS) requires segmenting the object in
video referred by a natural language query. Existing methods mainly rely on
sophisticated pipelines to tackle such cross-modal task, and do not explicitly
model the object-level spatial context which plays an important role in
locating the referred object. Therefore, we propose an end-to-end RVOS
framework completely built upon transformers, termed \textit{Fully
Transformer-Equipped Architecture} (FTEA), which treats the RVOS task as a mask
sequence learning problem and regards all the objects in video as candidate
objects. Given a video clip with a text query, the visual-textual features are
yielded by encoder, while the corresponding pixel-level and word-level features
are aligned in terms of semantic similarity. To capture the object-level
spatial context, we have developed the Stacked Transformer, which individually
characterizes the visual appearance of each candidate object, whose feature map
is decoded to the binary mask sequence in order directly. Finally, the model
finds the best matching between mask sequence and text query. In addition, to
diversify the generated masks for candidate objects, we impose a diversity loss
on the model for capturing more accurate mask of the referred object. Empirical
studies have shown the superiority of the proposed method on three benchmarks,
e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782
videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in
terms of $\mathcal{J\&F}$ on Ref-YouTube-VOS (3975 videos and 7451 objects).
Particularly, compared to the best candidate method, it has a gain of 2.1% and
3.2% in terms of P$@$0.5 on the former two, respectively, while it has a gain
of 2.9% in terms of $\mathcal{J}$ on the latter one.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - 1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object
Segmentation [12.100628128028385]
We improve one-stage method ReferFormer to obtain mask sequences strongly correlated with language descriptions.
We leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results.
Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge.
arXiv Detail & Related papers (2022-12-27T09:22:45Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Towards Robust Video Object Segmentation with Adaptive Object
Calibration [18.094698623128146]
Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames.
We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
arXiv Detail & Related papers (2022-07-02T17:51:29Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z) - Language as Queries for Referring Video Object Segmentation [23.743637144137498]
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer.
It views the language as queries and directly attends to the most relevant regions in the video frames.
arXiv Detail & Related papers (2022-01-03T05:54:00Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.