TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
- URL: http://arxiv.org/abs/2001.09099v2
- Date: Tue, 18 Aug 2020 15:12:14 GMT
- Title: TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
- Authors: Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
- Abstract summary: TV show Retrieval (TVR) is a new multimodal retrieval dataset.
TVR requires systems to understand both videos and their associated subtitle (dialogue) texts.
The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres.
- Score: 111.93601253692165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR
requires systems to understand both videos and their associated subtitle
(dialogue) texts, making it more realistic. The dataset contains 109K queries
collected on 21.8K videos from 6 TV shows of diverse genres, where each query
is associated with a tight temporal window. The queries are also labeled with
query types that indicate whether each of them is more related to video or
subtitle or both, allowing for in-depth analysis of the dataset and the methods
that built on top of it. Strict qualification and post-annotation verification
tests are applied to ensure the quality of the collected data. Further, we
present several baselines and a novel Cross-modal Moment Localization (XML )
network for multimodal moment retrieval tasks. The proposed XML model uses a
late fusion design with a novel Convolutional Start-End detector (ConvSE),
surpassing baselines by a large margin and with better efficiency, providing a
strong starting point for future work. We have also collected additional
descriptions for each annotated moment in TVR to form a new multimodal
captioning dataset with 262K captions, named TV show Caption (TVC). Both
datasets are publicly available. TVR: https://tvr.cs.unc.edu, TVC:
https://tvr.cs.unc.edu/tvc.html.
Related papers
- OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos [58.5538620720541]
The dataset, OVR, contains annotations for over 72K videos.
OVR is almost an order of magnitude larger than previous datasets for video repetition.
We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
arXiv Detail & Related papers (2024-07-24T08:22:49Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Partially Relevant Video Retrieval [39.747235541498135]
We propose a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR)
PRVR aims to retrieve partially relevant videos from a large collection of untrimmed videos.
We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames.
arXiv Detail & Related papers (2022-08-26T09:07:16Z) - AssistSR: Affordance-centric Question-driven Video Segment Retrieval [4.047098915826058]
Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
We present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
arXiv Detail & Related papers (2021-11-30T01:14:10Z) - MTVR: Multilingual Moment Retrieval in Videos [89.24431389933703]
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.
The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles.
We propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages.
arXiv Detail & Related papers (2021-07-30T20:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.