Deep Learning for Video-Text Retrieval: a Review
- URL: http://arxiv.org/abs/2302.12552v1
- Date: Fri, 24 Feb 2023 10:14:35 GMT
- Title: Deep Learning for Video-Text Retrieval: a Review
- Authors: Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo and Yu Liu
- Abstract summary: Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
- Score: 13.341694455581363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-Text Retrieval (VTR) aims to search for the most relevant video related
to the semantics in a given sentence, and vice versa. In general, this
retrieval task is composed of four successive steps: video and textual feature
representation extraction, feature embedding and matching, and objective
functions. In the last, a list of samples retrieved from the dataset is ranked
based on their matching similarities to the query. In recent years, significant
and flourishing progress has been achieved by deep learning techniques,
however, VTR is still a challenging task due to the problems like how to learn
an efficient spatial-temporal video feature and how to narrow the cross-modal
gap. In this survey, we review and summarize over 100 research papers related
to VTR, demonstrate state-of-the-art performance on several commonly
benchmarked datasets, and discuss potential challenges and directions, with the
expectation to provide some insights for researchers in the field of video-text
retrieval.
Related papers
- GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - ViLCo-Bench: VIdeo Language COntinual learning Benchmark [8.660555226687098]
We present ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks.
The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets.
We introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects.
arXiv Detail & Related papers (2024-06-19T00:38:19Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval.
This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches.
Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.