Related papers: A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

URL: http://arxiv.org/abs/2305.03347v1
Date: Fri, 5 May 2023 08:00:14 GMT
Title: A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
Authors: Weijia Wu and Yuzhong Zhao, Zhuang Li and Jiahong Li, Hong Zhou and Mike Zheng Shou and Xiang Bai
Abstract summary: We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
Score: 49.74647080936875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i.e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video. To study how to retrieve video with both modal inputs, i.e., visual and text semantic representations, we first introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR, which contains 42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV Show, and Cooking. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task. Besides, we present a detailed analysis of TextVR compared to the existing datasets and design a novel multimodal video retrieval baseline for the text-based video retrieval task. The dataset analysis and extensive experiments show that our TextVR benchmark provides many new technical challenges and insights from previous datasets for the video-and-language community. The project website and GitHub repo can be found at https://sites.google.com/view/loveucvpr23/guest-track and https://github.com/callsys/TextVR, respectively.

Related papers

VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. We introduce RTime, a novel temporal-emphasized video-text retrieval dataset. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information. Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z)
DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and Small Text [46.177941541282756]
We establish a video text reading benchmark, named DSText V2, which focuses on Dense and Small text reading challenges in the video. Compared with the previous datasets, the proposed dataset mainly include three new challenges. High-proportioned small texts, coupled with the blurriness and distortion in the video, will bring further challenges.
arXiv Detail & Related papers (2023-11-29T09:13:27Z)
Multi-event Video-Text Retrieval [33.470499262092105]
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. We introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task.
arXiv Detail & Related papers (2023-08-22T16:32:46Z)
TVPR: Text-to-Video Person Retrieval and a New Benchmark [19.554989977778312]
We propose a new task called Text-to-Video Person Retrieval(TVPR) TVPRN acquires video representations by fusing visual and motion representations of person videos. TVPRN has achieved state-of-the-art performance on TVPReid dataset.
arXiv Detail & Related papers (2023-07-14T06:34:00Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles. We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z)
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.