Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation
- URL: http://arxiv.org/abs/2305.16318v2
- Date: Tue, 12 Dec 2023 10:42:46 GMT
- Title: Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation
- Authors: Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang
Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao
- Abstract summary: We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
- Score: 54.58405154065508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, video object segmentation (VOS) referred by multi-modal signals,
e.g., language and audio, has evoked increasing attention in both industry and
academia. It is challenging for exploring the semantic alignment within
modalities and the visual correspondence across frames. However, existing
methods adopt separate network architectures for different modalities, and
neglect the inter-frame temporal interaction with references. In this paper, we
propose MUTR, a Multi-modal Unified Temporal transformer for Referring video
object segmentation. With a unified framework for the first time, MUTR adopts a
DETR-style transformer and is capable of segmenting video objects designated by
either text or audio reference. Specifically, we introduce two strategies to
fully explore the temporal relations between videos and multi-modal signals.
Firstly, for low-level temporal aggregation before the transformer, we enable
the multi-modal references to capture multi-scale visual cues from consecutive
video frames. This effectively endows the text or audio signals with temporal
knowledge and boosts the semantic alignment between modalities. Secondly, for
high-level temporal interaction after the transformer, we conduct inter-frame
feature communication for different object embeddings, contributing to better
object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and
AVSBench datasets with respective text and audio references, MUTR achieves
+4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our
significance for unified multi-modal VOS. Code is released at
https://github.com/OpenGVLab/MUTR.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Temporal Pyramid Transformer with Multimodal Interaction for Video
Question Answering [13.805714443766236]
Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding.
This paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA.
arXiv Detail & Related papers (2021-09-10T08:31:58Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - TransVOS: Video Object Segmentation with Transformers [13.311777431243296]
We propose a vision transformer to fully exploit and model both the temporal and spatial relationships.
To slim the popular two-encoder pipeline, we design a single two-path feature extractor.
Experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-06-01T15:56:10Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.