Video-Language Alignment via Spatio-Temporal Graph Transformer
- URL: http://arxiv.org/abs/2407.11677v2
- Date: Tue, 23 Jul 2024 18:48:06 GMT
- Title: Video-Language Alignment via Spatio-Temporal Graph Transformer
- Authors: Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin,
- Abstract summary: Video-language alignment is a crucial task that benefits various downstream applications, e.g., video-text retrieval and question answering.
We propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training.
- Score: 26.109883502401885
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting video-text alignment precision for benefiting downstream tasks. In addition, we propose a self-similarity alignment loss to explore the inherent self-similarity in the video and text. With the initial optimization achieved by contrastive learning, it can further promote the alignment accuracy between video and text. Experimental results on challenging downstream tasks, including video-text retrieval and video question answering, verify the superior performance of our method.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment [42.557643515992005]
Video understanding remains a challenge despite the availability of substantial web video-text data.
We introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding.
arXiv Detail & Related papers (2024-05-22T18:35:10Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.