Related papers: LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

URL: http://arxiv.org/abs/2503.03663v2
Date: Thu, 06 Mar 2025 16:25:37 GMT
Title: LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Authors: Wei Li, Bing Hu, Rui Shao, Leyang Shen, Liqiang Nie,
Abstract summary: "Fast & Slow Video-Language Thinker" is an onne vide assistat, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses.
Score: 49.541465732827504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.

Related papers

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning [33.37714717781103]
VideoMind is a novel video-language agent designed for temporal-grounded video understanding. We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow. We propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors.
arXiv Detail & Related papers (2025-03-17T17:59:33Z)
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning [29.89820310679906]
We propose an agent-based approach to enhance both the efficiency and effectiveness of long-form video understanding. A key aspect of our method is query-adaptive frame sampling, which leverages the reasoning capabilities of LLMs to process only the most relevant frames in real-time. We evaluate our method across several video understanding benchmarks and demonstrate that not only it enhances state-of-the-art performance but also improves efficiency by reducing the number of frames sampled.
arXiv Detail & Related papers (2024-10-26T19:01:06Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information. We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Video-based Person Re-identification with Long Short-Term Representation Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras. We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z)
Deep Unsupervised Key Frame Extraction for Efficient Video Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC) The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding [70.7882058229772]
This paper tackles an emerging and challenging problem of long video temporal grounding(VTG) Compared with short videos, long videos are also highly demanded but less explored. We propose CONE, an efficient COarse-to-fiNE alignment framework.
arXiv Detail & Related papers (2022-09-22T10:58:42Z)
Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework [93.37833982180538]
Video deraining is an important task in computer vision as the unwanted rain hampers the visibility of videos and deteriorates the robustness of most outdoor vision systems. We present a new end-to-end deraining framework, named Enhanced Spatio-Temporal Interaction Network (ESTINet) ESTINet considerably boosts current state-of-the-art video deraining quality and speed.
arXiv Detail & Related papers (2021-03-23T05:19:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.