Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
- URL: http://arxiv.org/abs/2308.04162v2
- Date: Fri, 17 Jan 2025 11:30:28 GMT
- Title: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
- Authors: Jiajun Chen, Jiacheng Lin, Guojin Zhong, Haolong Fu, Ke Nai, Kailun Yang, Zhiyong Li,
- Abstract summary: Two tasks aim to segment specific objects from video sequences according to expression prompts.
EPCFormer exploits the fact that audio and text prompts referring to the same objects are semantically equivalent.
The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks.
- Score: 36.7177155799825
- License:
- Abstract: Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly related tasks that both aim to segment specific objects from video sequences according to expression prompts. However, due to the challenges of modeling representations for different modalities, existing methods struggle to strike a balance between interaction flexibility and localization precision. In this paper, we address this problem from two perspectives: the alignment of audio and text and the deep interaction among audio, text, and visual modalities. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text. The proposed EPCFormer exploits the fact that audio and text prompts referring to the same objects are semantically equivalent by using contrastive learning for both types of expressions. Then, to facilitate deep interactions among audio, text, and visual modalities, we introduce an Expression-Visual Attention (EVA) module. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our EPCFormer attains state-of-the-art results on both tasks. The source code will be made publicly available at https://github.com/lab206/EPCFormer.
Related papers
- Can Textual Semantics Mitigate Sounding Object Segmentation Preference? [10.368382203643739]
We argue that audio lacks robust semantics compared to vision, resulting in weak audio guidance over the visual space.
Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance.
Our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.
arXiv Detail & Related papers (2024-07-15T17:45:20Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video.
This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges.
We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - V2C: Visual Voice Cloning [55.55301826567474]
We propose a new task named Visual Voice Cloning (V2C)
V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video.
Our dataset contains 10,217 animated movie clips covering a large variety of genres.
arXiv Detail & Related papers (2021-11-25T03:35:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.