Multi-event Video-Text Retrieval
- URL: http://arxiv.org/abs/2308.11551v2
- Date: Mon, 25 Sep 2023 13:04:22 GMT
- Title: Multi-event Video-Text Retrieval
- Authors: Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
- Abstract summary: Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet.
We introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events.
We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task.
- Score: 33.470499262092105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive
video-text data on the Internet. A plethora of work characterized by using a
two-stream Vision-Language model architecture that learns a joint
representation of video-text pairs has become a prominent approach for the VTR
task. However, these models operate under the assumption of bijective
video-text correspondences and neglect a more practical scenario where video
content usually encompasses multiple events, while texts like user queries or
webpage metadata tend to be specific and correspond to single events. This
establishes a gap between the previous training objective and real-world
applications, leading to the potential performance degradation of earlier
models during inference. In this study, we introduce the Multi-event Video-Text
Retrieval (MeVTR) task, addressing scenarios in which each video contains
multiple different events, as a niche scenario of the conventional Video-Text
Retrieval Task. We present a simple model, Me-Retriever, which incorporates key
event video representation and a new MeVTR loss for the MeVTR task.
Comprehensive experiments show that this straightforward framework outperforms
other models in the Video-to-Text and Text-to-Video tasks, effectively
establishing a robust baseline for the MeVTR task. We believe this work serves
as a strong foundation for future studies. Code is available at
https://github.com/gengyuanmax/MeVTR.
Related papers
- TRACE: Temporal Grounding Video LLM via Causal Event Modeling [6.596327795743185]
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing.
Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos.
This paper introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions.
We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
arXiv Detail & Related papers (2024-10-08T02:46:30Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Deep Learning for Video-Text Retrieval: a Review [13.341694455581363]
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
arXiv Detail & Related papers (2023-02-24T10:14:35Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.