Related papers: V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

URL: http://arxiv.org/abs/2511.20886v1
Date: Tue, 25 Nov 2025 22:06:30 GMT
Title: V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Authors: Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, Yuqian Fu,
Abstract summary: V2-SAM is a unified cross-view object correspondence framework.<n>It adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators.<n>V2-SAM achieves new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence)
Score: 90.92892171307055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

Related papers

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction [47.01100029571904]
We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios.<n>We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video.<n> Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance.
arXiv Detail & Related papers (2026-02-22T00:53:03Z)
Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z)
Robust Ego-Exo Correspondence with Long-Term Memory [34.992180181705]
We present a novel framework for establishing object-level correspondence between egocentric and exocentric views.<n>Our approach features a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE)<n>In the experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results.
arXiv Detail & Related papers (2025-10-13T13:54:12Z)
Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025 [93.36604217487526]
Given object queries from one perspective, the goal is to predict the corresponding object masks in another perspective.<n>To tackle this task, we propose a multimodal condition fusion module that enhances object localization.<n>Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark.
arXiv Detail & Related papers (2025-06-06T08:23:39Z)
ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models [10.858627659431928]
Service robots must effectively recognize and segment unknown objects to enhance their functionality.<n>Traditional supervised learningbased segmentation techniques require extensive annotated datasets.<n>This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT)
arXiv Detail & Related papers (2025-02-05T15:22:20Z)
ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives [109.11714588441511]
The Ego-Exo object correspondence task aims to understand object relations across ego-exo perspectives through segmentation.<n> PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task.<n>We propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion and SSL-based Cross-View Object Alignment.
arXiv Detail & Related papers (2024-11-28T12:01:03Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS) We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST) Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.