Fugu-MT 論文翻訳(概要): Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

論文の概要: Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

arxiv url: http://arxiv.org/abs/2603.15818v1
Date: Mon, 16 Mar 2026 18:49:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:06.951354
Title: Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition
Title（参考訳）: 衝突型マルチモーダルフュージョンによる視線・視線・視線認識
Authors: Salah Eddine Bekhouche, Hichem Telli, Azeddine Benlamoudi, Salah Eddine Herrouz, Abdelmalik Taleb-Ahmed, Abdenour Hadid,
Abstract要約: A/H(Ambivalence and hesitancy)は、異なるチャネルを通して対立する信号を表示する微妙な感情状態である。この問題のために構築されたマルチモーダルフレームワークである textbfConflictAwareAH を提示する。
参考スコア（独自算出の注目度）: 6.866068262311036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels -- saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features -- element-wise absolute differences between modality embeddings -- serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points -- all on a single GPU in under 25 minutes of training.
Abstract（参考訳）: A/H(Ambivalence and hesitancy)は、異なるチャンネルを通して対立する信号を見せる微妙な情緒的状態である。これらの状態を自動的に認識することは臨床環境では有用であるが、機械にとって重要な証拠が「emph{disagreements}」の中にあり、その言葉、どのように聞こえるか、顔が示すものの間に存在するため、難しい。本稿では,この問題のために構築されたマルチモーダルフレームワークであるtextbf{ConflictAwareAH}を紹介する。 3つの事前訓練エンコーダは、ビデオ、オーディオ、テキスト表現を抽出する。 Pairwise conflict features -- element-wise absolute difference between modality embeddeds -- serve as \emph{bidirectional} cues: large cross-modal difference flag A/H, less difference confirmed behavioral consistency and anchor the negative class。このコンフリクト対応設計は、テキスト単独でF1-NoAHを+4.6ポイント改善し、クラスパフォーマンスのギャップを埋めることに苦労しながら、A/H(High F1-AH)を過度に検出する傾向にあるテキスト優位アプローチの鍵となる制限に対処する。補的な \emph{text-guided late fusion} 戦略は、テキストのみの補助ヘッドと推論時に完全なモデルとを混合し、+4.1 Macro F1 を追加する。 ABAW10 Ambivalence/Hesitancy ChallengeのBAHデータセットでは、ラベル付きテストスプリットで \textbf{0.694 Macro F1} 、プライベートリーダーボードで \textbf{0.715} に到達し、発行されたマルチモーダルベースラインを10ポイント以上上回った。

関連論文リスト

Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation [2.5884126726585777]
会話におけるマルチモーダル感情認識には、複数のモーダルからの信号を統合する表現が必要である。対照的な学習と拡張に基づく手法の進歩は進歩してきたが、これらのコンポーネントの保存におけるデータ準備の役割を見落としていることが多い。 2相フレームワークemphtextbfDivideとtextbfRefine(textbfDnR)を提案する。これらの結果は、感情認識を促進するための原則的戦略として、マルチモーダル表現を明示的に分割、精製、再結合する効果を強調した。
論文参考訳（メタデータ） (2026-01-10T07:30:20Z)
Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations [1.1391158217994781]
アラビア語の方言文に対する多モーダルアプローチを用いたダイアクリティカル・リカバリ(DR)課題に取り組む。本稿では,CATT という名前の事前学習モデルから抽出したエンコーダを用いて,テキストのモダリティを表すモデルを提案する。実験の結果,提案手法は, 単語誤り率0.25, 文字誤り率0.9を実現していることがわかった。
論文参考訳（メタデータ） (2025-10-28T09:58:18Z)
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection [6.949059287049708]
本稿では,新しい音楽誤り検出手法であるtextitLadderSymを紹介する。 textitLadderSymは、最先端のアプローチに関する2つの重要な観察によって導かれる。我々は,各ノートカテゴリのF1スコアを測定し,textitMAESTRO-E と textitCocoChorales-E データセットについて評価を行った。
論文参考訳（メタデータ） (2025-09-16T02:15:06Z)
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers [79.94246924019984]
マルチモーダル拡散変換器 (MM-DiT) はテキスト駆動型視覚生成において顕著な進歩を遂げている。マルチモーダルインタラクションを動的に再バランスするパラメータ効率向上手法である textbfTemperature-Adjusted Cross-modal Attention (TACA) を提案する。本研究は,テキスト・画像拡散モデルにおける意味的忠実度向上における相互注意のバランスの重要性を強調した。
論文参考訳（メタデータ） (2025-06-09T17:54:04Z)
Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition [28.93482989766411]
文字の特徴を豊かにし,文字の識別性を向上する手法を提案する。 CACEは各ブロックに崩壊行列を導入し、各トークンの注意領域を明示的に案内する。 I2CLは、各文字カテゴリの長期記憶ユニットを学習することで、特徴の非ネイティブ性を改善する。
論文参考訳（メタデータ） (2024-07-08T02:33:29Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
マルチモーダル特徴の融合と復号を導くために,クロスモーダル・セマンティックスをマイニングする手法を提案する。具体的には,(1)全周減衰核融合(AF),(2)粗大デコーダ(CFD),(3)多層自己超越からなる新しいネットワークXMSNetを提案する。
論文参考訳（メタデータ） (2023-05-17T14:30:11Z)
Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
ゼロショットオープンボキャブラリ検出の鍵となる問題は、視覚的特徴とテキスト的特徴の整合性である。以前のアプローチでは、特徴ピラミッドと検出ヘッドをゼロからトレーニングし、事前トレーニング中に確立された視覚テキストの特徴アライメントを壊す。本稿では,これらの問題を緩和する3つの方法を提案する。まず,テキストの埋め込みを強化するための簡単なスキームを用いて,学習中に見られる少数のクラスへの過度な適合を防止する。次に、特徴ピラミッドネットワークと検出ヘッドをトレーニング可能なショートカットを含むように変更する。最後に、より大きなコーパスを活用するために、自己学習アプローチが使用される。
論文参考訳（メタデータ） (2023-03-23T17:59:53Z)
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
本稿では,データ拡張手法,すなわちクロスモーダルCutMixを提案する。 CMCは自然文をテキストビューからマルチモーダルビューに変換する。クロスモーダルノイズをユニモーダルデータにアタッチすることで、モダリティ間のトークンレベルの相互作用を学習し、より優れたデノゲーションを実現する。
論文参考訳（メタデータ） (2022-06-17T17:56:47Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。