Multi-entity Video Transformers for Fine-Grained Video Representation
Learning
- URL: http://arxiv.org/abs/2311.10873v1
- Date: Fri, 17 Nov 2023 21:23:12 GMT
- Title: Multi-entity Video Transformers for Fine-Grained Video Representation
Learning
- Authors: Matthew Walmer, Rose Kanjirathinkal, Kai Sheng Tai, Keyur Muzumdar,
Taipeng Tian, Abhinav Shrivastava
- Abstract summary: We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
- Score: 36.31020249963468
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The area of temporally fine-grained video representation learning aims to
generate frame-by-frame representations for temporally dense tasks. In this
work, we advance the state-of-the-art for this area by re-examining the design
of transformer architectures for video representation learning. A salient
aspect of our self-supervised method is the improved integration of spatial
information in the temporal pipeline by representing multiple entities per
frame. Prior works use late fusion architectures that reduce frames to a single
dimensional vector before any cross-frame information is shared, while our
method represents each frame as a group of entities or tokens. Our Multi-entity
Video Transformer (MV-Former) architecture achieves state-of-the-art results on
multiple fine-grained video benchmarks. MV-Former leverages image features from
self-supervised ViTs, and employs several strategies to maximize the utility of
the extracted features while also avoiding the need to fine-tune the complex
ViT backbone. This includes a Learnable Spatial Token Pooling strategy, which
is used to identify and extract features for multiple salient regions per
frame. Our experiments show that MV-Former not only outperforms previous
self-supervised methods, but also surpasses some prior works that use
additional supervision or training data. When combined with additional
pre-training data from Kinetics-400, MV-Former achieves a further performance
boost. The code for MV-Former is available at
https://github.com/facebookresearch/video_rep_learning.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.