QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries
- URL: http://arxiv.org/abs/2107.09609v1
- Date: Tue, 20 Jul 2021 16:42:58 GMT
- Title: QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries
- Authors: Jie Lei, Tamara L. Berg, Mohit Bansal
- Abstract summary: We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
- Score: 89.24431389933703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting customized moments and highlights from videos given natural
language (NL) user queries is an important but under-studied topic. One of the
challenges in pursuing this direction is the lack of annotated data. To address
this issue, we present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics,
from everyday activities and travel in lifestyle vlog videos to social and
political activities in news videos. Each video in the dataset is annotated
with: (1) a human-written free-form NL query, (2) relevant moments in the video
w.r.t. the query, and (3) five-point scale saliency scores for all
query-relevant clips. This comprehensive annotation enables us to develop and
evaluate systems that detect relevant moments as well as salient highlights for
diverse, flexible user queries. We also present a strong baseline for this
task, Moment-DETR, a transformer encoder-decoder model that views moment
retrieval as a direct set prediction problem, taking extracted video and query
representations as inputs and predicting moment coordinates and saliency scores
end-to-end. While our model does not utilize any human prior, we show that it
performs competitively when compared to well-engineered architectures. With
weakly supervised pretraining using ASR captions, Moment-DETR substantially
outperforms previous methods. Lastly, we present several ablations and
visualizations of Moment-DETR. Data and code is publicly available at
https://github.com/jayleicn/moment_detr
Related papers
- Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Background-aware Moment Detection for Video Moment Retrieval [19.11524416308641]
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query.
Due to the ambiguity, a query does not fully cover the relevant details of the corresponding moment.
We propose a background-aware moment detection transformer (BM-DETR)
Our model learns to predict the target moment from the joint probability of each frame given the positive query and the complement of negative queries.
arXiv Detail & Related papers (2023-06-05T09:26:33Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Query-Dependent Video Representation for Moment Retrieval and Highlight
Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query.
Recent transformer-based models do not fully exploit the information of a given query.
We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z) - AssistSR: Affordance-centric Question-driven Video Segment Retrieval [4.047098915826058]
Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
We present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
arXiv Detail & Related papers (2021-11-30T01:14:10Z) - Uncovering Hidden Challenges in Query-Based Video Moment Retrieval [29.90001703587512]
We present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task.
Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models.
We suggest possible directions to improve the temporal sentence grounding in the future.
arXiv Detail & Related papers (2020-09-01T10:07:23Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.