Single-Stage Visual Query Localization in Egocentric Videos
- URL: http://arxiv.org/abs/2306.09324v1
- Date: Thu, 15 Jun 2023 17:57:28 GMT
- Title: Single-Stage Visual Query Localization in Egocentric Videos
- Authors: Hanwen Jiang, Santhosh Kumar Ramakrishnan, Kristen Grauman
- Abstract summary: We propose a single-stage VQL framework that is end-to-end trainable.
We establish the query-video relationship by considering query-to-frame correspondences between the query and each video frame.
Our experiments demonstrate that our approach outperforms prior VQL methods by 20% accuracy while obtaining a 10x improvement in inference speed.
- Score: 79.71065005161566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Query Localization on long-form egocentric videos requires
spatio-temporal search and localization of visually specified objects and is
vital to build episodic memory systems. Prior work develops complex multi-stage
pipelines that leverage well-established object detection and tracking methods
to perform VQL. However, each stage is independently trained and the complexity
of the pipeline results in slow inference speeds. We propose VQLoC, a novel
single-stage VQL framework that is end-to-end trainable. Our key idea is to
first build a holistic understanding of the query-video relationship and then
perform spatio-temporal localization in a single shot manner. Specifically, we
establish the query-video relationship by jointly considering query-to-frame
correspondences between the query and each video frame and frame-to-frame
correspondences between nearby video frames. Our experiments demonstrate that
our approach outperforms prior VQL methods by 20% accuracy while obtaining a
10x improvement in inference speed. VQLoC is also the top entry on the Ego4D
VQ2D challenge leaderboard. Project page: https://hwjiang1510.github.io/VQLoC/
Related papers
- PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization [32.75411084716383]
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos.
We introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL.
arXiv Detail & Related papers (2025-02-11T17:04:31Z) - TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.
We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.
We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - VidCEP: Complex Event Processing Framework to Detect Spatiotemporal
Patterns in Video Streams [5.53329677986653]
Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion.
Current CEP systems have inherent limitations to query video streams due to their unstructured data model and expressive query language.
We propose VidCEP, an in-memory, near real-time complex event matching framework for video streams.
arXiv Detail & Related papers (2020-07-15T16:43:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.