Related papers: EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

URL: http://arxiv.org/abs/2507.15428v1
Date: Mon, 21 Jul 2025 09:27:45 GMT
Title: EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent
Authors: Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen,
Abstract summary: EgoPrune is a training-free token pruning method tailored for egomotion video reasoning.<n>EgoPrune consistently outperforms prior training-free methods across various pruning ratios.<n>We deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device.
Score: 41.11532785015233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Egomotion videos are first-person recordings where the view changes continuously due to the agent's movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real-world deployment. Recent advances in vision-language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third-person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training-free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective-Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)-based token selector that jointly considers visual-text relevance and intra-frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training-free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real-world efficiency and suitability for on-device egomotion video reasoning.

Related papers

EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
EgoVLM: Policy Optimization for Egocentric Video Understanding [2.397572703240721]
We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning.<n>EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps.<n>Our EgoVLMB, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the Ego benchmark, respectively.
arXiv Detail & Related papers (2025-06-03T17:28:00Z)
SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video [11.198924693073353]
We pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large.
arXiv Detail & Related papers (2024-06-13T03:57:38Z)
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? [48.702973928321946]
Egocentric video-language pretraining is a crucial step in advancing the understanding of hand-object interactions in first-person scenarios.<n>Despite successes on existing testbeds, we find that current EgoVLMs can be easily misled by simple modifications.<n>This raises the question: Do EgoVLMs truly understand hand-object interactions?
arXiv Detail & Related papers (2024-05-28T00:27:29Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
EgoVSR: Towards High-Quality Egocentric Video Super-Resolution [23.50915512118989]
EgoVSR is a Video Super-Resolution framework specifically designed for egocentric videos. We explicitly tackle motion blurs in egocentric videos using a Dual Branch Deblur Network (DB$2$Net) in the VSR framework. An online motion blur synthesis model for common VSR training data is proposed to simulate motion blurs as in egocentric videos.
arXiv Detail & Related papers (2023-05-24T04:25:51Z)
Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z)
EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.