Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval
- URL: http://arxiv.org/abs/2209.13307v1
- Date: Tue, 27 Sep 2022 11:13:48 GMT
- Title: Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval
- Authors: Chengzhi Lin, Ancong Wu, Junwei Liang, Jun Zhang, Wenhang Ge, Wei-Shi
Zheng, Chunhua Shen
- Abstract summary: Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
- Score: 125.55386778388818
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cross-modal retrieval between videos and texts has gained increasing research
interest due to the rapid emergence of videos on the web. Generally, a video
contains rich instance and event information and the query text only describes
a part of the information. Thus, a video can correspond to multiple different
text descriptions and queries. We call this phenomenon the ``Video-Text
Correspondence Ambiguity'' problem. Current techniques mostly concentrate on
mining local or multi-level alignment between contents of a video and text
(\textit{e.g.}, object to entity and action to verb). It is difficult for these
methods to alleviate the video-text correspondence ambiguity by describing a
video using only one single feature, which is required to be matched with
multiple different text features at the same time. To address this problem, we
propose a Text-Adaptive Multiple Visual Prototype Matching model, which
automatically captures multiple prototypes to describe a video by adaptive
aggregation of video token features. Given a query text, the similarity is
determined by the most similar prototype to find correspondence in the video,
which is termed text-adaptive matching. To learn diverse prototypes for
representing the rich information in videos, we propose a variance loss to
encourage different prototypes to attend to different contents of the video.
Our method outperforms state-of-the-art methods on four public video retrieval
datasets.
Related papers
- Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Bi-Calibration Networks for Weakly-Supervised Video Representation
Learning [153.54638582696128]
We introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning.
We present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa.
BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks.
arXiv Detail & Related papers (2022-06-21T16:02:12Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.