AssistSR: Affordance-centric Question-driven Video Segment Retrieval
- URL: http://arxiv.org/abs/2111.15050v1
- Date: Tue, 30 Nov 2021 01:14:10 GMT
- Title: AssistSR: Affordance-centric Question-driven Video Segment Retrieval
- Authors: Stan Weixian Lei, Yuxuan Wang, Dongxing Mao, Difei Gao, Mike Zheng
Shou
- Abstract summary: Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
We present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
- Score: 4.047098915826058
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: It is still a pipe dream that AI assistants on phone and AR glasses can
assist our daily life in addressing our questions like "how to adjust the date
for this watch?" and "how to set its heating duration? (while pointing at an
oven)". The queries used in conventional tasks (i.e. Video Question Answering,
Video Retrieval, Moment Localization) are often factoid and based on pure text.
In contrast, we present a new task called Affordance-centric Question-driven
Video Segment Retrieval (AQVSR). Each of our questions is an image-box-text
query that focuses on affordance of items in our daily life and expects
relevant answer segments to be retrieved from a corpus of instructional
video-transcript segments. To support the study of this AQVSR task, we
construct a new dataset called AssistSR. We design novel guidelines to create
high-quality samples. This dataset contains 1.4k multimodal questions on 1k
video segments from instructional videos on diverse daily-used items. To
address AQVSR, we develop a straightforward yet effective model called Dual
Multimodal Encoders (DME) that significantly outperforms several baseline
methods while still having large room for improvement in the future. Moreover,
we present detailed ablation analyses. Our codes and data are available at
https://github.com/StanLei52/AQVSR.
Related papers
- Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval [111.93601253692165]
TV show Retrieval (TVR) is a new multimodal retrieval dataset.
TVR requires systems to understand both videos and their associated subtitle (dialogue) texts.
The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres.
arXiv Detail & Related papers (2020-01-24T17:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.