Learning Cross-Modal Affinity for Referring Video Object Segmentation
Targeting Limited Samples
- URL: http://arxiv.org/abs/2309.02041v1
- Date: Tue, 5 Sep 2023 08:34:23 GMT
- Title: Learning Cross-Modal Affinity for Referring Video Object Segmentation
Targeting Limited Samples
- Authors: Guanghui Li, Mingqi Gao, Heng Liu, Xiantong Zhen, Feng Zheng
- Abstract summary: Referring video object segmentation (RVOS) relies on sufficient data for a given scene.
In more realistic scenarios, only minimal annotations are available for a new scene.
We propose a model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture.
CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios.
- Score: 61.66967790884943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation (RVOS), as a supervised learning task,
relies on sufficient annotated data for a given scene. However, in more
realistic scenarios, only minimal annotations are available for a new scene,
which poses significant challenges to existing RVOS methods. With this in mind,
we propose a simple yet effective model with a newly designed cross-modal
affinity (CMA) module based on a Transformer architecture. The CMA module
builds multimodal affinity with a few samples, thus quickly learning new
semantic information, and enabling the model to adapt to different scenarios.
Since the proposed method targets limited samples for new scenes, we generalize
the problem as - few-shot referring video object segmentation (FS-RVOS). To
foster research in this direction, we build up a new FS-RVOS benchmark based on
currently available datasets. The benchmark covers a wide range and includes
multiple situations, which can maximally simulate real-world scenarios.
Extensive experiments show that our model adapts well to different scenarios
with only a few samples, reaching state-of-the-art performance on the
benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance
of 53.1 J and 54.8 F, which are 10% better than the baselines. Furthermore, we
show impressive results of 77.7 J and 74.8 F on Mini-Ref-SAIL-VOS, which are
significantly better than the baselines. Code is publicly available at
https://github.com/hengliusky/Few_shot_RVOS.
Related papers
- 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - Scalable Video Object Segmentation with Simplified Framework [21.408446548059956]
This paper presents a scalable VOS (SimVOS) framework to perform joint feature extraction and matching.
SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features.
Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-19T04:30:48Z) - MOSE: A New Dataset for Video Object Segmentation in Complex Scenes [106.64327718262764]
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
arXiv Detail & Related papers (2023-02-03T17:20:03Z) - Learning What to Learn for Video Object Segmentation [157.4154825304324]
We introduce an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module.
This internal learner is designed to predict a powerful parametric model of the target.
We set a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5.
arXiv Detail & Related papers (2020-03-25T17:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.