Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval
- URL: http://arxiv.org/abs/2211.11351v1
- Date: Mon, 21 Nov 2022 11:08:13 GMT
- Title: Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval
- Authors: Damianos Galanopoulos, Vasileios Mezaris
- Abstract summary: We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
- Score: 9.537322316673617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we tackle the cross-modal video retrieval problem and, more
specifically, we focus on text-to-video retrieval. We investigate how to
optimally combine multiple diverse textual and visual features into feature
pairs that lead to generating multiple joint feature spaces, which encode
text-video pairs into comparable representations. To learn these
representations our proposed network architecture is trained by following a
multiple space learning procedure. Moreover, at the retrieval stage, we
introduce additional softmax operations for revising the inferred query-video
similarities. Extensive experiments in several setups based on three
large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to
best combine text-visual features and document the performance of the proposed
network. Source code is made publicly available at:
https://github.com/bmezaris/TextToVideoRetrieval-TtimesV
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Video-Text Retrieval by Supervised Sparse Multi-Grained Learning [22.17732989393653]
We present a novel multi-grained sparse learning framework, S3MA, to learn an sparse space shared between the video and the text for video-text retrieval.
With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses.
Benefiting from the learned shared sparse space and multi-grained similarities, experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods.
arXiv Detail & Related papers (2023-02-19T04:03:22Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Bi-Calibration Networks for Weakly-Supervised Video Representation
Learning [153.54638582696128]
We introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning.
We present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa.
BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks.
arXiv Detail & Related papers (2022-06-21T16:02:12Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.