Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks
- URL: http://arxiv.org/abs/2401.03177v1
- Date: Sat, 6 Jan 2024 09:38:55 GMT
- Title: Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks
- Authors: Qian Li, Lixin Su, Jiashu Zhao, Long Xia, Hengyi Cai, Suqi Cheng,
Hengzhu Tang, Junfeng Wang, Dawei Yin
- Abstract summary: Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
- Score: 25.96897989272303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-video retrieval is a challenging task that aims to identify relevant
videos given textual queries. Compared to conventional textual retrieval, the
main obstacle for text-video retrieval is the semantic gap between the textual
nature of queries and the visual richness of video content. Previous works
primarily focus on aligning the query and the video by finely aggregating
word-frame matching signals. Inspired by the human cognitive process of
modularly judging the relevance between text and video, the judgment needs
high-order matching signal due to the consecutive and complex nature of video
contents. In this paper, we propose chunk-level text-video matching, where the
query chunks are extracted to describe a specific retrieval unit, and the video
chunks are segmented into distinct clips from videos. We formulate the
chunk-level matching as n-ary correlations modeling between words of the query
and frames of the video and introduce a multi-modal hypergraph for n-ary
correlation modeling. By representing textual units and video frames as nodes
and using hyperedges to depict their relationships, a multi-modal hypergraph is
constructed. In this way, the query and the video can be aligned in a
high-order semantic space. In addition, to enhance the model's generalization
ability, the extracted features are fed into a variational inference component
for computation, obtaining the variational representation under the Gaussian
distribution. The incorporation of hypergraphs and variational inference allows
our model to capture complex, n-ary interactions among textual and visual
contents. Experimental results demonstrate that our proposed method achieves
state-of-the-art performance on the text-video retrieval task.
Related papers
- GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.