HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
- URL: http://arxiv.org/abs/2212.14546v1
- Date: Fri, 30 Dec 2022 04:27:01 GMT
- Title: HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
- Authors: Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei
Huang
- Abstract summary: We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
- Score: 49.52679453475878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-language pre-training has advanced the performance of various
downstream video-language tasks. However, most previous methods directly
inherit or adapt typical image-language pre-training paradigms to
video-language pre-training, thus not fully exploiting the unique
characteristic of video, i.e., temporal. In this paper, we propose a
Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with
two novel pre-training tasks for modeling cross-modal alignment between moments
and texts as well as the temporal relations of video-text pairs. Specifically,
we propose a cross-modal moment exploration task to explore moments in videos,
which results in detailed video moment representation. Besides, the inherent
temporal relations are captured by aligning video-text pairs as a whole in
different time resolutions with multi-modal temporal relation exploration task.
Furthermore, we introduce the shuffling test to evaluate the temporal reliance
of datasets and video-language pre-training models. We achieve state-of-the-art
results on 15 well-established video-language understanding and generation
tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and
SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also
demonstrates strong generalization ability when directly transferred to
downstream tasks in a zero-shot manner. Models and demo will be available on
ModelScope.
Related papers
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - Contrastive Language Video Time Pre-training [12.876308881183371]
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning.
Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features.
We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-06-04T02:48:59Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.