Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge
- URL: http://arxiv.org/abs/2209.15280v1
- Date: Fri, 30 Sep 2022 07:39:48 GMT
- Title: Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge
- Authors: Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia,
Yixiao Ge
- Abstract summary: We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
- Score: 65.40899722211726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training on large-scale video data has become a common recipe for
learning transferable spatiotemporal representations in recent years. Despite
some progress, existing methods are mostly limited to highly curated datasets
(e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We
argue that it is due to the fact that they only capture pixel-level knowledge
rather than spatiotemporal commonsense, which is far away from cognition-level
video understanding. Inspired by the great success of image-text pre-training
(e.g., CLIP), we take the first step to exploit language semantics to boost
transferable spatiotemporal representation learning. We introduce a new pretext
task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR
scripts by attending to learned video representations. We do not rely on
descriptive captions and learn purely from video, i.e., leveraging the natural
transcribed speech knowledge to provide noisy but useful semantics over time.
Furthermore, rather than the simple concept learning in vision-caption
contrast, we encourage cognition-level temporal commonsense reasoning via
narrative reorganization. The advantages enable our model to contextualize what
is happening like human beings and seamlessly apply to large-scale uncurated
video data in the real world. Note that our method differs from ones designed
for video-text alignment (e.g., Frozen) and multimodal representation learning
(e.g., Merlot). Our method demonstrates strong out-of-the-box spatiotemporal
representations on diverse video benchmarks, e.g., +13.6% gains over VideoMAE
on SSV2 via linear probing.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Contrastive Language Video Time Pre-training [12.876308881183371]
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning.
Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features.
We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-06-04T02:48:59Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.