On Pursuit of Designing Multi-modal Transformer for Video Grounding
- URL: http://arxiv.org/abs/2109.06085v1
- Date: Mon, 13 Sep 2021 16:01:19 GMT
- Title: On Pursuit of Designing Multi-modal Transformer for Video Grounding
- Authors: Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou
- Abstract summary: Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video.
We propose a novel end-to-end multi-modal Transformer model, dubbed as bfGTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction.
All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.
- Score: 35.25323276744999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video grounding aims to localize the temporal segment corresponding to a
sentence query from an untrimmed video. Almost all existing video grounding
methods fall into two frameworks: 1) Top-down model: It predefines a set of
segment candidates and then conducts segment classification and regression. 2)
Bottom-up model: It directly predicts frame-wise probabilities of the
referential segment boundaries. However, all these methods are not end-to-end,
\ie, they always rely on some time-consuming post-processing steps to refine
predictions. To this end, we reformulate video grounding as a set prediction
task and propose a novel end-to-end multi-modal Transformer model, dubbed as
\textbf{GTR}. Specifically, GTR has two encoders for video and language
encoding, and a cross-modal decoder for grounding prediction. To facilitate the
end-to-end training, we use a Cubic Embedding layer to transform the raw videos
into a set of visual tokens. To better fuse these two modalities in the
decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is
optimized via a Many-to-One matching loss. Furthermore, we conduct
comprehensive studies to investigate different model design choices. Extensive
results on three benchmarks have validated the superiority of GTR. All three
typical GTR variants achieve record-breaking performance on all datasets and
metrics, with several times faster inference speed.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - VMFormer: End-to-End Video Matting with Transformer [48.97730965527976]
Video matting aims to predict alpha mattes for each frame from a given input video sequence.
Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN)
We propose VMFormer: a transformer-based end-to-end method for video matting.
arXiv Detail & Related papers (2022-08-26T17:51:02Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.