End-to-End Referring Video Object Segmentation with Multimodal
Transformers
- URL: http://arxiv.org/abs/2111.14821v1
- Date: Mon, 29 Nov 2021 18:59:32 GMT
- Title: End-to-End Referring Video Object Segmentation with Multimodal
Transformers
- Authors: Adam Botach, Evgenii Zheltonozhskii, Chaim Baskin
- Abstract summary: We propose a simple Transformer-based approach to the referring video object segmentation task.
Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem.
MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The referring video object segmentation task (RVOS) involves segmentation of
a text-referred object instance in the frames of a given video. Due to the
complex nature of this multimodal task, which combines text reasoning, video
understanding, instance segmentation and tracking, existing approaches
typically rely on sophisticated pipelines in order to tackle it. In this paper,
we propose a simple Transformer-based approach to RVOS. Our framework, termed
Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence
prediction problem. Following recent advancements in computer vision and
natural language processing, MTTR is based on the realization that video and
text can both be processed together effectively and elegantly by a single
multimodal Transformer model. MTTR is end-to-end trainable, free of
text-related inductive bias components and requires no additional
mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline
considerably compared to existing methods. Evaluation on standard benchmarks
reveals that MTTR significantly outperforms previous art across multiple
metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the
A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76
frames per second. In addition, we report strong results on the public
validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has
yet to receive the attention of researchers. The code to reproduce our
experiments is available at https://github.com/mttr2021/MTTR
Related papers
- Task-Specific Alignment and Multiple Level Transformer for Few-Shot
Action Recognition [11.700737340560796]
In recent years, some works have used the Transformer to deal with frames, then get the attention feature and the enhanced prototype, and the results are competitive.
We address these problems through an end-to-end method named "Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)"
Our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something 2-something V2 datasets.
arXiv Detail & Related papers (2023-07-05T02:13:25Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - End-to-End Video Text Spotting with Transformer [86.46724646835627]
We propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR)
TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition)
arXiv Detail & Related papers (2022-03-20T12:14:58Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - SOTR: Segmenting Objects with Transformers [0.0]
We present a novel, flexible, and effective transformer-based model for high-quality instance segmentation.
The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline.
Our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches.
arXiv Detail & Related papers (2021-08-15T14:10:11Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
Retrieval [40.646628490887075]
We propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval.
HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results.
Inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly.
arXiv Detail & Related papers (2021-03-28T04:52:25Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.