A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation
- URL: http://arxiv.org/abs/2304.05930v2
- Date: Mon, 26 Feb 2024 20:25:03 GMT
- Title: A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation
- Authors: Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam
- Abstract summary: We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video.
We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
- Score: 13.703655451279921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present an end-to-end trainable unified multiscale
encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses
multiscale representation throughout and employs an optional input beyond video
(e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale
representation at both encoder and decoder yields three key benefits: (i)
implicit extraction of spatiotemporal features at different levels of
abstraction for capturing dynamics without reliance on input optical flow, (ii)
temporal consistency at encoding and (iii) coarse-to-fine detection for
high-level (e.g., object) semantics to guide precise localization at decoding.
Moreover, we present a transductive learning scheme through many-to-many label
propagation to provide temporally consistent video predictions. We showcase
MED-VT/MED-VT++ on three unimodal video segmentation tasks (Automatic Video
Object Segmentation (AVOS), actor-action segmentation and Video Semantic
Segmentation (VSS)) as well as a multimodal segmentation task (Audio-Visual
Segmentation (AVS)). Results show that the proposed architecture outperforms
alternative state-of-the-art approaches on multiple benchmarks using only video
(and optional audio) as input, without reliance on optical flow. Finally, to
document details of the model's internal learned representations, we present a
detailed interpretability study, encompassing both quantitative and qualitative
analyses.
Related papers
- EVC-MF: End-to-end Video Captioning Network with Multi-scale Features [13.85795110061781]
We propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning.
It efficiently utilizes multi-scale visual and textual features to generate video descriptions.
The results demonstrate that EVC-MF yields competitive performance compared with the state-of-theart methods.
arXiv Detail & Related papers (2024-10-22T02:16:02Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition [11.573689558780764]
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-Vi) for video action recognition.
In order to handle large number of tokens extracted from multiple modalities, we develop several model variants which factorize self-attention across the space, time and modality dimensions.
Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy.
arXiv Detail & Related papers (2021-08-20T18:05:39Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.