Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval
- URL: http://arxiv.org/abs/2201.09168v1
- Date: Sun, 23 Jan 2022 03:38:37 GMT
- Title: Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval
- Authors: Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan
He, Xun Wang
- Abstract summary: Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
- Score: 41.420760047617506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper aims for the task of text-to-video retrieval, where given a query
in the form of a natural-language sentence, it is asked to retrieve videos
which are semantically relevant to the given query, from a great number of
unlabeled videos. The success of this task depends on cross-modal
representation learning that projects both videos and sentences into common
spaces for semantic similarity computation. In this work, we concentrate on
video representation learning, an essential component for text-to-video
retrieval. Inspired by the reading strategy of humans, we propose a
Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent
videos, which consists of two branches: a previewing branch and an
intensive-reading branch. The previewing branch is designed to briefly capture
the overview information of videos, while the intensive-reading branch is
designed to obtain more in-depth information. Moreover, the intensive-reading
branch is aware of the video overview captured by the previewing branch. Such
holistic information is found to be useful for the intensive-reading branch to
extract more fine-grained features. Extensive experiments on three datasets are
conducted, where our model RIVRL achieves a new state-of-the-art on TGIF and
VATEX. Moreover, on MSR-VTT, our model using two video features shows
comparable performance to the state-of-the-art using seven video features and
even outperforms models pre-trained on the large-scale HowTo100M dataset.
Related papers
- Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Deep Learning for Video-Text Retrieval: a Review [13.341694455581363]
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
arXiv Detail & Related papers (2023-02-24T10:14:35Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.