Fugu-MT 論文翻訳(概要): Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge

論文の概要: Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge

arxiv url: http://arxiv.org/abs/2603.23650v1
Date: Tue, 24 Mar 2026 18:49:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:10.994406
Title: Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge
Title（参考訳）: BLEMOREチャレンジのためのマルチモーダル融合アプローチ
Authors: Masoumeh Chapariniya, Aref Farhadipour, Sarah Ebling, Volker Dellwo, Teodora Vukovic,
Abstract要約: 本稿では,FG 2026におけるBLEMOREチャレンジのシステムについて,相対的サリエンス予測と混合感情認識について紹介する。我々の12エンコーダシステムはテストセット上でScore = 0.279 (ACCP = 0.391, ACCS = 0.168) を達成する。
参考スコア（独自算出の注目度）: 5.518749541105996
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and -- for the first time in emotion recognition -- Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6--12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold $β$ varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62\% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.
Abstract（参考訳）: 本稿では,FG 2026におけるBLEMOREチャレンジのシステムについて,相対的サリエンス予測と混合感情認識について紹介する。 S4D-ViTMoE フェースエンコーダにソフトラベル KL トレーニング,凍結層選択 Wav2Vec2 オーディオ機能,微調整されたボディランゲージエンコーダ (TimeSformer, VideoMAE) と -- の6つのエンコーダファミリーを組み合わせる。凍結したWav2Vec2から韻律符号化層 (6-12) を選択すると、BLEMORE音声の非言語的性質が音素層を無関係にする(スコア0.207 vs. 0.161)、後処理サリエンスしきい値のβ$が0.05から0.43に変化し、パーソナライズされた表現スタイルが主要なボトルネックであること、タスク適応エンコーダが総じて62\%のアンサンブル重みを受け取る、という3つの重要な結果が得られた。我々の12エンコーダシステムはテストセット上でScore = 0.279 (ACCP = 0.391, ACCS = 0.168) を達成する。

関連論文リスト

Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR [9.626217175791572]
本稿では,外部の音声エンコーダや事前訓練された大言語モデル(LLM)を使わずに,音声とテキストを単一のスタックで処理する,自動音声認識(ASR)のためのデコーダのみのコンバータを提案する。モデルは、モダリティを意識した専門家のスパース混合(MoE: Disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer block)を使用する。 5言語に1つの多言語モデルを持つCommon Voice 16.1では、平均WERを12.2%から10.6%に削減する。
論文参考訳（メタデータ） (2026-02-13T02:53:54Z)
WavLink: Compact Audio-Text Embeddings with a Global Whisper Token [4.000493292896401]
We present WavLink, a compact audio-text embeddedding model that a augment Whisper encoder with a learnable global token。 3つのモデルサイズにわたる2段階のトレーニングレシピと、Matryoshkaスタイルの監視を組み合わせることで、スケーラビリティを改善し、パフォーマンスの低下を最小限に抑えた8倍の小さな埋め込みを可能にしました。
論文参考訳（メタデータ） (2026-01-21T15:55:58Z)
Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems [2.9034429823924865]
本稿では,MLC-SLM Challenge 2025のための大規模言語モデル(LLM)を用いた多言語音声認識と言語モデリングに焦点を当てた。本システムでは,Qwen2.5-7Bをデコーダのみの言語モデルとして,Gemma3-12Bと18.6%を用いて,プライベートテスト平均WER/CERの16.63%の競合性能を実現している。
論文参考訳（メタデータ） (2025-06-16T15:23:07Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
異なるレベルのオーディオ/視覚エンコーダに融合することで、各モードの表現を促進する多層クロスアテンション融合に基づくAVSR手法を提案する。提案手法は第1位システムを超え,新たなSOTA cpCERの29.13%をこのデータセット上に構築する。
論文参考訳（メタデータ） (2024-01-07T08:59:32Z)
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors [117.61449210940955]
ビデオフレームレベルで適用された軽量マスク付きオートエンコーダ(AE)に基づく効率的な異常事象検出モデルを提案する。動き勾配に基づく重みトークンへのアプローチを導入し、静的背景シーンから前景オブジェクトへ焦点を移す。トレーニングビデオの強化のために合成異常事象を生成し,マスク付きAEモデルを用いてオリジナルのフレームを共同で再構築する。
論文参考訳（メタデータ） (2023-06-21T06:18:05Z)
Low-complexity deep learning frameworks for acoustic scene classification [64.22762153453175]
音響シーン分類(ASC)のための低複雑さ深層学習フレームワークを提案する。提案するフレームワークは、フロントエンドのスペクトログラム抽出、オンラインデータ拡張、バックエンドの分類、予測される確率の後期融合の4つの主要なステップに分けることができる。 DCASE 2022 Task 1 Development データセットで実施した実験は,低複雑さの要求を十分に満たし,最も高い分類精度を 60.1% で達成した。
論文参考訳（メタデータ） (2022-06-13T11:41:39Z)
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [81.53783563025084]
本稿では、BERTのような予測損失に対して、アライメントされたターゲットラベルを提供するオフラインクラスタリングステップを提案する。提案手法の重要な要素は,マスク領域にのみ予測損失を適用することである。 HuBERTは、より困難なdev-otherおよびtest-other評価サブセットに対して、最大19%と13%の相対的なWER削減を示す。
論文参考訳（メタデータ） (2021-06-14T14:14:28Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。