STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training
- URL: http://arxiv.org/abs/2302.09736v2
- Date: Wed, 24 May 2023 01:03:09 GMT
- Title: STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training
- Authors: Weihong Zhong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng
Feng, Bing Qin
- Abstract summary: We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
- Score: 30.16501510589718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although large-scale video-language pre-training models, which usually build
a global alignment between the video and the text, have achieved remarkable
progress on various downstream tasks, the idea of adopting fine-grained
information during the pre-training stage is not well explored. In this work,
we propose STOA-VLP, a pre-training framework that jointly models object and
action information across spatial and temporal dimensions. More specifically,
the model regards object trajectories across frames and multiple action
features from the video as fine-grained features. Besides, We design two
auxiliary tasks to better incorporate both kinds of information into the
pre-training process of the video-language model. The first is the dynamic
object-text alignment task, which builds a better connection between object
trajectories and the relevant noun tokens. The second is the spatial-temporal
action set prediction, which guides the model to generate consistent action
features by predicting actions found in the text. Extensive experiments on
three downstream tasks (video captioning, text-video retrieval, and video
question answering) demonstrate the effectiveness of our proposed STOA-VLP
(e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9%
accuracy improvements on MSVD video question answering benchmark, compared to
previous approaches).
Related papers
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.