EgoVLM: Policy Optimization for Egocentric Video Understanding
- URL: http://arxiv.org/abs/2506.03097v1
- Date: Tue, 03 Jun 2025 17:28:00 GMT
- Title: EgoVLM: Policy Optimization for Egocentric Video Understanding
- Authors: Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu,
- Abstract summary: We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning.<n>EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps.<n>Our EgoVLMB, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the Ego benchmark, respectively.
- Score: 2.397572703240721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.
Related papers
- Zero-Shot Temporal Interaction Localization for Egocentric Videos [13.70694228506315]
We propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos.<n>By absorbing both 2D and 3D observations, EgoLoc directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI.<n>EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-04T07:52:46Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos.<n>We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding.<n>We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.<n>VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.<n>Our model-agnostic approach is computationally efficient, requiring only 32 frames.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.<n>We automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data.<n>We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation [57.38965505987893]
Ego-VPA is a parameter-efficient adaptation for egocentric video tasks.<n>Ego-VPA excels in lightweight adaptation with only 0.84% learnable parameters.
arXiv Detail & Related papers (2024-07-28T16:01:32Z) - AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding [44.79843213164787]
Embodied AI personal assistants require embodied understanding to collaborate with humans effectively.
Current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric experience.
We introduce the Egocentric Video Understanding dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos.
We present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
arXiv Detail & Related papers (2024-06-19T20:14:14Z) - EgoPCA: A New Framework for Egocentric Hand-Object Interaction
Understanding [99.904140768186]
This paper proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA)
We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy.
We believe our data and the findings will pave a new way for Ego-HOI understanding.
arXiv Detail & Related papers (2023-09-05T17:51:16Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.