SpotEM: Efficient Video Search for Episodic Memory
- URL: http://arxiv.org/abs/2306.15850v1
- Date: Wed, 28 Jun 2023 00:52:49 GMT
- Title: SpotEM: Efficient Video Search for Episodic Memory
- Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman
- Abstract summary: episodic memory aims to search a long egocentric video to answer a natural language query.
Existing methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer.
We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy.
- Score: 92.98552727430483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal in episodic memory (EM) is to search a long egocentric video to
answer a natural language query (e.g., "where did I leave my purse?"). Existing
EM methods exhaustively extract expensive fixed-length clip features to look
everywhere in the video for the answer, which is infeasible for long
wearable-camera videos that span hours or even days. We propose SpotEM, an
approach to achieve efficiency for a given EM method while maintaining good
accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that
learns to identify promising video regions to search conditioned on the
language query; 2) a set of low-cost semantic indexing features that capture
the context of rooms, objects, and interactions that suggest where to look; and
3) distillation losses that address the optimization issues arising from
end-to-end joint training of the clip selector and EM model. Our experiments on
200+ hours of video from the Ego4D EM Natural Language Queries benchmark and
three different EM models demonstrate the effectiveness of our approach:
computing only 10% - 25% of the clip features, we preserve 84% - 97% of the
original EM model's accuracy. Project page:
https://vision.cs.utexas.edu/projects/spotem
Related papers
- Unleashing Hour-Scale Video Training for Long Video-Language Understanding [61.717205915329664]
We present VideoMarathon, a large-scale hour-long video instruction-following dataset.<n>This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video.<n>We propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling.
arXiv Detail & Related papers (2025-06-05T17:59:04Z) - MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - Magic 1-For-1: Generating One Minute Video Clips within One Minute [53.07214657235465]
We present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency.
By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics.
arXiv Detail & Related papers (2025-02-11T16:58:15Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method.
MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Temporal Stochastic Softmax for 3D CNNs: An Application in Facial
Expression Recognition [11.517316695930596]
We present a strategy for efficient video-based training of 3D CNNs.
It relies on softmax temporal pooling and a weighted sampling mechanism to select the most relevant training clips.
arXiv Detail & Related papers (2020-11-10T16:40:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.