X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
- URL: http://arxiv.org/abs/2203.15086v1
- Date: Mon, 28 Mar 2022 20:47:37 GMT
- Title: X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
- Authors: Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan,
Maksims Volkovs, Animesh Garg, Guangwei Yu
- Abstract summary: In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
- Score: 26.581384985173116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In text-video retrieval, the objective is to learn a cross-modal similarity
function between a text and a video that ranks relevant text-video pairs higher
than irrelevant pairs. However, videos inherently express a much wider gamut of
information than texts. Instead, texts often capture sub-regions of entire
videos and are most semantically similar to certain frames within videos.
Therefore, for a given text, a retrieval model should focus on the text's most
semantically similar video sub-regions to make a more relevant comparison. Yet,
most existing works aggregate entire videos without directly considering text.
Common text-agnostic aggregations schemes include mean-pooling or
self-attention over the frames, but these are likely to encode misleading
visual information not described in the given text. To address this, we propose
a cross-modal attention model called X-Pool that reasons between a text and the
frames of a video. Our core mechanism is a scaled dot product attention for a
text to attend to its most semantically similar frames. We then generate an
aggregated video representation conditioned on the text's attention weights
over the frames. We evaluate our method on three benchmark datasets of MSR-VTT,
MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative
improvement in Recall@1. Our findings thereby highlight the importance of joint
text-video reasoning to extract important visual cues according to text. Full
code and demo can be found at: https://layer6ai-labs.github.io/xpool/
Related papers
- Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval [6.656989511639513]
The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between each pair of text (consisting of words) and video (consisting of audio and image frames) representations.
We propose a novel multi-granularity feature interaction module called MGFI, consisting of text-frame and word-frame.
We also introduce a cross-modal feature interaction module of audio and text called CMFI to solve the problem of insufficient expression of frames in the video.
arXiv Detail & Related papers (2024-06-21T02:28:06Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval [31.79030663958162]
We propose a new text modeling method T-MASS to enrich text embedding with a flexible and resilient semantic range.
To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs.
T-MASS achieves state-of-the-art performance on five benchmark datasets.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [72.59262815400928]
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation.
We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
arXiv Detail & Related papers (2023-12-25T16:37:39Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Video and Text Matching with Conditioned Embeddings [81.81028089100727]
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa.
In this work, we encode the dataset data in a way that takes into account the query's relevant information.
We show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.
arXiv Detail & Related papers (2021-10-21T17:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.