AHA - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead
- URL: http://arxiv.org/abs/2509.16421v2
- Date: Tue, 23 Sep 2025 00:52:32 GMT
- Title: AHA - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead
- Authors: Aiden Chang, Celso De Melo, Stephanie M. Lukin,
- Abstract summary: Aha is an autoregressive highlight detection framework that predicts relevance of each video frame against a task described in natural language.<n>Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks.<n>We explore Aha's potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video.
- Score: 4.55107996328448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time understanding of continuous video streams is essential for intelligent agents operating in high-stakes environments, including autonomous vehicles, surveillance drones, and disaster response robots. Yet, most existing video understanding and highlight detection methods assume access to the entire video during inference, making them unsuitable for online or streaming scenarios. In particular, current models optimize for offline summarization, failing to support step-by-step reasoning needed for real-time decision-making. We introduce Aha, an autoregressive highlight detection framework that predicts the relevance of each video frame against a task described in natural language. Without accessing future video frames, Aha utilizes a multimodal vision-language model and lightweight, decoupled heads trained on a large, curated dataset of human-centric video labels. To enable scalability, we introduce the Dynamic SinkCache mechanism that achieves constant memory usage across infinite-length streams without degrading performance on standard benchmarks. This encourages the hidden representation to capture high-level task objectives, enabling effective frame-level rankings for informativeness, relevance, and uncertainty with respect to the natural language task. Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks, surpassing even prior offline, full-context approaches and video-language models by +5.9% on TVSum and +8.3% on Mr. Hisum in mAP (mean Average Precision). We explore Aha's potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video. Both experiments demonstrate Aha's potential effectiveness as a real-time reasoning module for downstream planning and long-horizon understanding.
Related papers
- VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning [42.22791607763693]
VideoVeritas is a framework for fine-grained perception and fact-based reasoning.<n>Joint Perception Preference and Perception Pretext Reinforcement Learning is used.
arXiv Detail & Related papers (2026-02-09T16:00:01Z) - Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z) - CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos [73.51386721543135]
We propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories.<n>CLAP maps video transitions onto a quantized, physically executable codebook.<n>We introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation.
arXiv Detail & Related papers (2026-01-07T16:26:33Z) - mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs [5.109732854501585]
We introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations.<n>Our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
arXiv Detail & Related papers (2025-12-17T18:47:31Z) - StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios [33.70462645363648]
StreamEQA is the first benchmark for streaming video question answering in embodied scenarios.<n>It is built upon 156 independent long videos and generates approximately 21K question-answer pairs with precise timestamps.<n>We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
arXiv Detail & Related papers (2025-12-04T04:48:16Z) - StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding [52.55809460075286]
We propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information.<n>We integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events.<n>Our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
arXiv Detail & Related papers (2025-08-03T18:15:42Z) - ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition [3.271109623410664]
We propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations.<n>Our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications.
arXiv Detail & Related papers (2025-03-10T12:43:51Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities.
We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.