Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
- URL: http://arxiv.org/abs/2509.16677v1
- Date: Sat, 20 Sep 2025 13:03:43 GMT
- Title: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
- Authors: Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang,
- Abstract summary: Action-based video object segmentation addresses this by linking segmentation with action semantics.<n>We take the first step by studying action-based video object segmentation under label noise.<n>We adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them.
- Score: 22.45673628231233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.
Related papers
- Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation [79.13636675697096]
Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS)<n>MQA-RefAVS is a task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations.<n>We propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information.
arXiv Detail & Related papers (2026-02-03T07:47:59Z) - LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Noisy Annotations in Semantic Segmentation [5.139071616097179]
This study sheds light on the quality of segmentation masks produced by various models.<n>It challenges the efficacy of popular methods designed to address learning with label noise.
arXiv Detail & Related papers (2024-06-16T10:49:23Z) - Exploratory Evaluation of Speech Content Masking [7.012446339121189]
We introduce a toy problem that explores an emerging type of privacy called "content masking"
We evaluate a baseline masking technique based on modifying sequences of discrete phone representations (phone codes)
We investigate three different masking locations and three types of masking strategies: noise substitution, word deletion, and phone sequence reversal.
arXiv Detail & Related papers (2024-01-08T14:56:03Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples.
We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Static object detection and segmentation in videos based on dual
foregrounds difference with noise filtering [0.0]
This paper presents static object detection and segmentation method in videos from cluttered scenes.
The proposed method was built for rock breaker station application and effectively validated with real, synthetic and two public data sets.
arXiv Detail & Related papers (2020-12-19T15:01:59Z) - Towards Noise-resistant Object Detection with Noisy Annotations [119.63458519946691]
Training deep object detectors requires significant amount of human-annotated images with accurate object labels and bounding box coordinates.
Noisy annotations are much more easily accessible, but they could be detrimental for learning.
We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise.
arXiv Detail & Related papers (2020-03-03T01:32:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.