Fugu-MT 論文翻訳(概要): Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

論文の概要: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

arxiv url: http://arxiv.org/abs/2509.16677v1
Date: Sat, 20 Sep 2025 13:03:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:15.919921
Title: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
Title（参考訳）: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation towards the Embodied Intelligence (特集:情報工学)
Authors: Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang,
Abstract要約: アクションベースのビデオオブジェクトセグメンテーションは、セグメンテーションとアクションセマンティクスをリンクすることでこの問題に対処する。まず、ラベルノイズ下でのアクションベースビデオオブジェクトのセグメンテーションについて検討する。この設定に6つのラベルノイズ学習戦略を適用し、評価のためのプロトコルを確立する。
参考スコア（独自算出の注目度）: 22.45673628231233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.
Abstract（参考訳）: 身体的知性は、相互作用に積極的に関与するオブジェクトを正確に分割することに依存する。アクションベースのビデオオブジェクトセグメンテーションは、セグメンテーションとアクションセマンティクスをリンクすることでこの問題に対処するが、コストが高く、一貫性がなく、不正確なマスクや参照曖昧さのようなマルチモーダルノイズに起因する大規模なアノテーションやプロンプトに依存する。現在までこの課題は未解決のままである。本研究では,ラベルノイズ下での動作に基づく映像オブジェクトのセグメンテーションについて,テキスト・プロンプト・ノイズ(カテゴリ・フリップとカテゴリ内名詞置換)とマスク・アノテーション・ノイズ(不正確な監視を模倣する摂動物体境界)の2つの源に着目して研究する。私たちの貢献は3倍です。まず,アクションベースビデオオブジェクトセグメンテーションタスクに2種類のラベルノイズを導入する。第2に、ラベルノイズベンチマークActiSeg-NLに基づいて、最初のアクションベースビデオオブジェクトセグメンテーションを構築し、この設定に6つのラベルノイズ学習戦略を適用し、それらをテキスト、境界、混合ノイズ下で評価するためのプロトコルを確立する。第3に、ノイズタイプを障害モードとロバスト性ゲインにリンクする包括的分析を行い、マスクのノイズに対処するためのパラレルマスクヘッド機構(PMHM)を導入する。定性的な評価は、境界の漏れや境界の摂動による非局在化、テキストのフリップによる時折のアイデンティティ置換など、特性的な障害モードをさらに明らかにする。比較分析の結果, 異なる学習戦略が, 前景のトレードオフによって支配され, 背景の精度が向上する一方で, 前景の精度が背景の精度に優先されていることが明らかとなった。確立されたベンチマークとソースコードはhttps://github.com/mylwx/ActiSeg-NLで公開される。

論文の概要: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

関連論文リスト