Robust Ego-Exo Correspondence with Long-Term Memory
- URL: http://arxiv.org/abs/2510.11417v1
- Date: Mon, 13 Oct 2025 13:54:12 GMT
- Title: Robust Ego-Exo Correspondence with Long-Term Memory
- Authors: Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, Libo Zhang,
- Abstract summary: We present a novel framework for establishing object-level correspondence between egocentric and exocentric views.<n>Our approach features a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE)<n>In the experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results.
- Score: 34.992180181705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.
Related papers
- OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation [45.521771044784195]
OFL-SAM2 is a prompt-free framework for label-efficient medical image segmentation.<n>Our core idea is to leverage limited annotated samples to train a lightweight mapping network.<n>Experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.
arXiv Detail & Related papers (2025-12-31T13:41:16Z) - V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence [90.92892171307055]
V2-SAM is a unified cross-view object correspondence framework.<n>It adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators.<n>V2-SAM achieves new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence)
arXiv Detail & Related papers (2025-11-25T22:06:30Z) - SAMSON: 3rd Place Solution of LSVOS 2025 VOS Challenge [9.131199997701282]
Large-scale Video Object module (LSVOS) addresses the challenge of accurately tracking and segmenting objects in long video sequences.<n>Our method achieved a final performance of 0.8427 in terms of J &F in the test-set leaderboard.
arXiv Detail & Related papers (2025-09-22T08:30:34Z) - SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks [50.97089872043121]
We propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet.<n>We extend the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder.<n>Our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs.
arXiv Detail & Related papers (2025-08-05T15:36:13Z) - HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback [0.0]
We introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory.<n>Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video.
arXiv Detail & Related papers (2025-07-25T03:28:05Z) - Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation [18.71772979219666]
We introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy.<n>MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements.<n>Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis 2017 and EndoVis 2018 datasets.
arXiv Detail & Related papers (2025-07-13T11:05:25Z) - MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z) - MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection [21.22536962888316]
We present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory.<n>MoSAM achieves state-of-the-art results compared to other competitors.
arXiv Detail & Related papers (2025-04-30T02:19:31Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.