O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views
- URL: http://arxiv.org/abs/2506.06026v1
- Date: Fri, 06 Jun 2025 12:19:08 GMT
- Title: O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views
- Authors: Lorenzo Mur-Labadia, Maria Santos-Villafranca, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Ruben Martinez-Cantin, Jose J. Guerrero,
- Abstract summary: This report re-defines cross-image segmentation by treating it as a mask matching task.<n>Our method consists of: (1) A Mask-Context that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$leftarrow$Exo Cross-Attention that fuses multi-perspective observations, and (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space.
- Score: 41.98584849589569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of the correspondence task is to segment specific objects across different views. This technical report re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$\leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.
Related papers
- HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens.<n>HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm.<n>We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z) - SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation.<n>Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z) - ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives [109.11714588441511]
The Ego-Exo object correspondence task aims to understand object relations across ego-exo perspectives through segmentation.<n> PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task.<n>We propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion and SSL-based Cross-View Object Alignment.
arXiv Detail & Related papers (2024-11-28T12:01:03Z) - Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image.
We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z) - Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification [13.995453649985732]
We propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks.
Our approach extracts shared features for both tasks using a dual-branch architecture.
Our proposed framework reduces the overall complexity compared with using separate networks for both tasks.
arXiv Detail & Related papers (2024-04-22T22:02:19Z) - Variance-insensitive and Target-preserving Mask Refinement for
Interactive Image Segmentation [68.16510297109872]
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing.
We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs.
Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
arXiv Detail & Related papers (2023-12-22T02:31:31Z) - Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation [29.43462426812185]
We propose a paradigm change by shifting from a per-pixel classification to a mask classification.
Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture.
By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results.
arXiv Detail & Related papers (2023-09-08T20:07:18Z) - Multimodal Diffusion Segmentation Model for Object Segmentation from
Manipulation Instructions [0.0]
We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object.
We build a new dataset based on the well-known Matterport3D and REVERIE datasets.
The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
arXiv Detail & Related papers (2023-07-17T16:07:07Z) - Discovering Object Masks with Transformers for Unsupervised Semantic
Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation.
Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.