Fugu-MT 論文翻訳(概要): EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

論文の概要: EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

arxiv url: http://arxiv.org/abs/2603.18082v1
Date: Wed, 18 Mar 2026 07:55:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:05.749644
Title: EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities
Title（参考訳）: EgoAdapt: 欠損モード下でのエゴセントリックな対話型話者検出におけるロバスト性向上
Authors: Xinyuan Qian, Xinjia Zhu, Alessio Brutti, Dong Liang,
Abstract要約: 本研究では,モダリティの欠如した話者検出のための適応型フレームワークであるEgoAdaptを紹介する。 EgoAdaptには3つの重要なモジュールが組み込まれている。(1)視覚話者目標認識(VSTR)モジュールは、音声特徴抽出のためのパラレル共有オーディオ(PSA)エンコーダである。 EgoAdaptは平均平均精度(mAP)67.39%、精度(Acc)62.01%を達成する。
参考スコア（独自算出の注目度）: 18.332508545927578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.
Abstract（参考訳）: TTM(Talking to Me)タスクは、カメラ装着者との会話に誰が関わっているかを決定することを目的として、人間の社会的相互作用を理解する上で重要な要素である。従来のモデルは、視覚的なデータ不足、ヘッドオリエンテーションの役割の欠如、バックグラウンドノイズなど、現実のシナリオでしばしば課題に直面します。本研究では,エゴセントリックな"Talking to Me"話者検出のための適応フレームワークであるEgoAdaptを導入することで,これらの制約に対処する。特に、EgoAdaptは、3つの主要なモジュールを組み込んでいる: 1) 頭向きを非言語的キューとして捉えるビジュアル話者認識(VSTR)モジュール、2) 言語的信号と非言語的信号の両方を言語的キューとして捉え、TTMに対処するための包括的解釈を可能にする、(2) 雑音の多い環境での音声特徴抽出を向上するためのパラレル共有オーディオ(PSA)エンコーダ、3) それぞれのフレームにおける各モードの有無を推定し、システムの応答を動的に調整するVMMA(Visual Modality Missing Awareness)モジュール。

論文の概要: EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

関連論文リスト