A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
- URL: http://arxiv.org/abs/2305.03347v1
- Date: Fri, 5 May 2023 08:00:14 GMT
- Title: A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
- Authors: Weijia Wu and Yuzhong Zhao, Zhuang Li and Jiahong Li, Hong Zhou and
Mike Zheng Shou and Xiang Bai
- Abstract summary: We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
- Score: 49.74647080936875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing cross-modal language-to-video retrieval (VR) research focuses
on single-modal input from video, i.e., visual representation, while the text
is omnipresent in human environments and frequently critical to understand
video. To study how to retrieve video with both modal inputs, i.e., visual and
text semantic representations, we first introduce a large-scale and cross-modal
Video Retrieval dataset with text reading comprehension, TextVR, which contains
42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street
View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV
Show, and Cooking. The proposed TextVR requires one unified cross-modal model
to recognize and comprehend texts, relate them to the visual context, and
decide what text semantic information is vital for the video retrieval task.
Besides, we present a detailed analysis of TextVR compared to the existing
datasets and design a novel multimodal video retrieval baseline for the
text-based video retrieval task. The dataset analysis and extensive experiments
show that our TextVR benchmark provides many new technical challenges and
insights from previous datasets for the video-and-language community. The
project website and GitHub repo can be found at
https://sites.google.com/view/loveucvpr23/guest-track and
https://github.com/callsys/TextVR, respectively.
Related papers
- Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and
Small Text [46.177941541282756]
We establish a video text reading benchmark, named DSText V2, which focuses on Dense and Small text reading challenges in the video.
Compared with the previous datasets, the proposed dataset mainly include three new challenges.
High-proportioned small texts, coupled with the blurriness and distortion in the video, will bring further challenges.
arXiv Detail & Related papers (2023-11-29T09:13:27Z) - Multi-event Video-Text Retrieval [33.470499262092105]
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet.
We introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events.
We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task.
arXiv Detail & Related papers (2023-08-22T16:32:46Z) - TVPR: Text-to-Video Person Retrieval and a New Benchmark [19.554989977778312]
We propose a new task called Text-to-Video Person Retrieval(TVPR)
TVPRN acquires video representations by fusing visual and motion representations of person videos.
TVPRN has achieved state-of-the-art performance on TVPReid dataset.
arXiv Detail & Related papers (2023-07-14T06:34:00Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.