Related papers: ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos

ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos

URL: http://arxiv.org/abs/2411.19083v1
Date: Thu, 28 Nov 2024 12:01:03 GMT
Title: ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos
Authors: Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Gool,
Abstract summary: Ego-Exo Object Correspondence task aims to map objects across ego-centric and exo-centric views.<n>We introduce ObjectRelator, a novel method designed to tackle this task.
Score: 105.40690994956667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we focus on the Ego-Exo Object Correspondence task, an emerging challenge in the field of computer vision that aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task, featuring two new modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse effectively fuses language and visual conditions to enhance target object localization, while XObjAlign enforces consistency in object representations across views through a self-supervised alignment strategy. Extensive experiments demonstrate the effectiveness of ObjectRelator, achieving state-of-the-art performance on Ego2Exo and Exo2Ego tasks with minimal additional parameters. This work provides a foundation for future research in comprehensive cross-view object relation understanding highlighting the potential of leveraging multimodal guidance and cross-view alignment. Codes and models will be released to advance further research in this direction.

Related papers

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence [90.92892171307055]
V2-SAM is a unified cross-view object correspondence framework.<n>It adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators.<n>V2-SAM achieves new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence)
arXiv Detail & Related papers (2025-11-25T22:06:30Z)
DOMR: Establishing Cross-View Segmentation via Dense Object Matching [18.965608205456338]
Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views.<n>We propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views.
arXiv Detail & Related papers (2025-08-06T03:23:37Z)
Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025 [93.36604217487526]
Given object queries from one perspective, the goal is to predict the corresponding object masks in another perspective.<n>To tackle this task, we propose a multimodal condition fusion module that enhances object localization.<n>Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark.
arXiv Detail & Related papers (2025-06-06T08:23:39Z)
v-CLR: View-Consistent Learning for Open-World Instance Segmentation [24.32192108470939]
Vanilla visual networks are biased toward learning appearance information, eg texture, to recognize objects.<n>This implicit bias causes the model to fail in detecting novel objects with textures unseen in the open-world setting.<n>We propose view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation.
arXiv Detail & Related papers (2025-04-02T05:52:30Z)
ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models [10.858627659431928]
Service robots must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT)
arXiv Detail & Related papers (2025-02-05T15:22:20Z)
ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation [14.765419467710812]
Egocentric hand-object segmentation (EgoHOS) is a promising new task aiming at segmenting hands and interacting objects in egocentric images. This paper proposes a novel Object-centric Relationship Modeling Network (ORMNet) to fulfill end-to-end and effective EgoHOS.
arXiv Detail & Related papers (2024-07-08T03:17:10Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model. Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image. We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects. Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.