Fugu-MT 論文翻訳(概要): MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model

論文の概要: MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model

arxiv url: http://arxiv.org/abs/2605.25409v1
Date: Mon, 25 May 2026 04:21:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.282329
Title: MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model
Title（参考訳）: MTLLFM: Multimodal-Temporal Laughter Localization:UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model
Authors: Eyal Hanania, Nadav Kirsch, Daniel Arkushin, Jonathan Benvenisti, Amos Bercovich, Elie Zemmour, Sahar Froim,
Abstract要約: UR-FUNNY-TemporalデータセットとSMILE-Temporalデータセットを導入し,2つのユーモアベンチマークを拡張した。私たちのアノテーションは11,053本のビデオ(78.8時間)をカバーし、それぞれの笑いイベントに対して正確なオンセット/オフセット境界を提供します。本アーキテクチャでは,HuBERTとMAEエンコーダを時間的ソフトマックスプーリングと適応的モダリティゲーティングを組み合わせることで,クリップレベルのラベルから微細な時間的グラウンドを学習する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Detecting laughter in video is essential for affective computing and narrative understanding, yet existing approaches treat it as coarse clip-level classification, failing to capture precise temporal boundaries of brief, transient laughter events. We address this gap with two complementary contributions. First, we introduce UR-FUNNY-Temporal and SMILE-Temporal, fully annotated temporal laughter datasets extending two widely-used humor benchmarks. Our annotations cover over 11,053 videos (78.8 hours) and provide precise onset/offset boundaries for each laughter event, along with rich metadata distinguishing speaker vs. audience laughter, modality dominance (acoustic, visual, or both), and intensity levels. Second, we propose a lightweight weakly-supervised framework for temporal laughter localization. Our architecture combines fixed HuBERT and MAE encoders with temporal softmax pooling and adaptive modality gating, learning fine-grained temporal grounding from clip-level labels without requiring frame-level annotations during training. Experiments across three datasets demonstrate that our approach substantially outperforms multimodal foundation models including Gemini 3 Flash, achieving 99% F1 and 68.1% localization precision on sports broadcast data. Ablations validate each architectural component. Furthermore, our precise temporal tags improve downstream laughter reasoning by 227% on CIDEr, enabling GPT-3.5 to outperform GPT-4o. The code, UR-FUNNY-Temporal and SMILE-Temporal datasets are publicly available at https://github.com/WSCSports/MTLLFM-temporal-laughter-localization.
Abstract（参考訳）: 映像中の笑いを検知することは感情的な計算と物語の理解に不可欠であるが、既存の手法ではクリップレベルの粗い分類として扱い、短時間で過渡的な笑いイベントの正確な時間的境界を捉えていない。このギャップを2つの補完的な貢献で解決する。まず, UR-FUNNY-Temporal と SMILE-Temporal の2つのユーモアベンチマークを拡張した時間的笑いデータセットを紹介する。私たちのアノテーションは、11,053本のビデオ(78.8時間)をカバーし、各笑いイベントの正確なオンセット/オフセット境界を提供し、話者と観客の笑いを区別する豊富なメタデータ、モーダリティ支配(音響、視覚、またはその両方)、強度レベルを提供する。第2に、時間的笑いの局所化のための軽量な弱教師付きフレームワークを提案する。本アーキテクチャでは,HuBERTとMAEエンコーダを時間的ソフトマックスプーリングと適応的モダリティゲーティングを組み合わせることで,フレームレベルのアノテーションを必要とせずにクリップレベルのラベルから微粒な時間的グラウンドを学習する。 3つのデータセットにわたる実験により、我々のアプローチは、Gemini 3 Flashを含むマルチモーダル基盤モデルを大幅に上回り、スポーツ放送データに対する99%のF1と68.1%のローカライゼーション精度を達成した。アブレーションはそれぞれのアーキテクチャコンポーネントを検証する。さらに, 正確な時間的タグにより, CIDErでは227%のダウンストリーム笑い推論が向上し, GPT-3.5はGPT-4oを上回った。 UR-FUNNY-TemporalとSMILE-Temporalのデータセットはhttps://github.com/WSCSports/MTLLFM-temporal-laughter-localizationで公開されている。

論文の概要: MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model

関連論文リスト