NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
- URL: http://arxiv.org/abs/2301.00746v2
- Date: Sat, 25 Mar 2023 04:46:18 GMT
- Title: NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
- Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman
- Abstract summary: Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
- Score: 92.98552727430483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Searching long egocentric videos with natural language queries (NLQ) has
compelling applications in augmented reality and robotics, where a fluid index
into everything that a person (agent) has seen before could augment human
memory and surface relevant information on demand. However, the structured
nature of the learning problem (free-form text query inputs, localized video
temporal window outputs) and its needle-in-a-haystack nature makes it both
technically challenging and expensive to supervise. We introduce
Narrations-as-Queries (NaQ), a data augmentation strategy that transforms
standard video-text narrations into training data for a video query
localization model. Validating our idea on the Ego4D benchmark, we find it has
tremendous impact in practice. NaQ improves multiple top models by substantial
margins (even doubling their accuracy), and yields the very best results to
date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in
the CVPR and ECCV 2022 competitions and topping the current public leaderboard.
Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique
properties of our approach such as the ability to perform zero-shot and
few-shot NLQ, and improved performance on queries about long-tail object
categories. Code and models:
{\small\url{http://vision.cs.utexas.edu/projects/naq}}.
Related papers
- Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting [15.161997580529075]
This paper explores the novel challenge of VideoQA within a continual learning framework.
We propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting.
Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches.
arXiv Detail & Related papers (2024-10-01T15:07:07Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - Vamos: Versatile Action Models for Video Understanding [23.631145570126268]
We propose versatile action models (Vamos), a learning framework powered by a large language model as the reasoner''
We evaluate Vamos on five benchmarks, Ego4D, NeXT-QA, IntentQA, Spacewalk-18, and Ego on its capability to model temporal dynamics, encode visual history, and perform reasoning.
arXiv Detail & Related papers (2023-11-22T17:44:24Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Co-attentional Transformers for Story-Based Video Understanding [24.211255523490692]
We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas.
We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions.
arXiv Detail & Related papers (2020-10-27T07:17:09Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.