Related papers: InterRVOS: Interaction-aware Referring Video Object Segmentation

InterRVOS: Interaction-aware Referring Video Object Segmentation

URL: http://arxiv.org/abs/2506.02356v3
Date: Mon, 18 Aug 2025 07:41:54 GMT
Title: InterRVOS: Interaction-aware Referring Video Object Segmentation
Authors: Woojeong Jin, Seongchan Kim, Jaeho Lee, Seungryong Kim,
Abstract summary: We introduce Interaction-aware Referring Video Object (InterRVOS), a novel task that focuses on the modeling of interactions.<n>It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction.<n>We present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects.
Score: 44.55538737075162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM-based architecture that introduces interaction-aware special tokens and leverages an attention mask loss to enhance role-specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS-127K evaluation set, but also achieves strong performance on standard RVOS benchmarks. Our project page is available at: https://cvlab-kaist.github.io/InterRVOS.

Related papers

rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding [7.264443471771696]
We propose a novel real-time interactive perception framework, rt-RISeg, that continuously segments unseen objects by robot interactions.<n>We demonstrate that the relative rotational and linear velocities of randomly sampled body frames, resulting from selected robot interactions, can be used to identify objects without any learned segmentation model.<n>We showcase the effectiveness of our proposed interactive perception method by achieving an average object segmentation accuracy rate 27.5% greater than state-of-the-art UOIS methods.
arXiv Detail & Related papers (2025-07-14T20:02:52Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
RefCut: Interactive Segmentation with Reference Guidance [44.872055134890864]
RefCut is a reference-based interactive segmentation framework to address part ambiguity and object ambiguity.<n>Our code will be publicly available and the demo video is in https://www.lin-zheng.com/refcut.
arXiv Detail & Related papers (2025-03-22T17:14:20Z)
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation [14.144097766150397]
We present a dataset called Multi-target and Multi-granularity Reasoning (MMR)<n>MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects.<n>We propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation.
arXiv Detail & Related papers (2025-03-18T04:23:09Z)
ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives [109.11714588441511]
The Ego-Exo object correspondence task aims to understand object relations across ego-exo perspectives through segmentation.<n> PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task.<n>We propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion and SSL-based Cross-View Object Alignment.
arXiv Detail & Related papers (2024-11-28T12:01:03Z)
CaRe-Ego: Contact-aware Relationship Modeling for Egocentric Interactive Hand-object Segmentation [14.765419467710812]
Egocentric Interactive hand-object segmentation (EgoIHOS) is crucial for understanding human behavior in assistive systems.<n>Previous methods recognize hands and interacting objects as distinct semantic categories based solely on visual features.<n>We propose CaRe-Ego, which emphasizes the contact between hands and objects from two aspects.
arXiv Detail & Related papers (2024-07-08T03:17:10Z)
Learning from Exemplars for Interactive Image Segmentation [15.37506525730218]
We introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Our model reduces users' labor by around 15%, requiring two fewer clicks to achieve target IoUs 85% and 90%.
arXiv Detail & Related papers (2024-06-17T12:38:01Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. We trained our model on a large-scale video object segmentation dataset. Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z)
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework. It simultaneously performs Detection And Interaction Reasoning in one stage. We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.