Fugu-MT 論文翻訳(概要): Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

論文の概要: Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

arxiv url: http://arxiv.org/abs/2510.26978v1
Date: Thu, 30 Oct 2025 20:01:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:15.90628
Title: Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
Title（参考訳）: 映像コメント生成のためのセマンティックフレーム集約型変換器
Authors: Anam Fatima, Yi Yu, Janak Kapuriya, Julien Lalanne, Jainendra Shukla,
Abstract要約: 本稿では,ライブビデオストリーム上で,文脈的に適切なビデオコメントを生成するための新しいモデルを提案する。私たちはCLIPの視覚テキストマルチモーダル知識を用いて、進行中の視聴者会話に対する意味的関連性に基づいて、映像フレームに重みを割り当てる。コメントデコーダとクロスアテンション機構により、生成されたコメントは、チャットとビデオの両方の文脈的手がかりを反映する。
参考スコア（独自算出の注目度）: 10.604889675520925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.
Abstract（参考訳）: Twitchなどのプラットフォームでビデオストリームのライブコメントが人気を博し、ダイナミックなインタラクションを通じて視聴者のエンゲージメントを高めている。しかし、文脈的に適切なコメントを自動的に生成することは、挑戦的でエキサイティングな作業です。ビデオストリームは膨大な量のデータと外部コンテンツを含むことができる。既存のアプローチは、進行中の視聴者インタラクションに最も関係のあるビデオフレームの優先順位付けにおいて重要な側面を見落としてしまう傾向にある。この優先順位付けは、文脈的に適切なコメントを生成するために不可欠である。このギャップに対処するために,ビデオコメント生成のための新しいセマンティックフレームアグリゲーションベーストランスフォーマー(SFAT)モデルを提案する。この方法は、CLIPの視覚テキストマルチモーダル知識を利用してコメントを生成するだけでなく、視聴者の会話に対する意味的関連性に基づいてビデオフレームに重みを割り当てる。効果的な重み付けのフレーム技法を用いて、無関係なフレームに焦点を絞らずに、情報的フレームを強調する。最後に、各モダリティに対応するクロスアテンション機構を備えたコメントデコーダにより、生成されたコメントが、チャットとビデオの両方の文脈的手がかりを反映することを保証します。さらに,動画カテゴリが限定された中国語コンテンツに主眼を置いている既存のデータセットの限界に対処するため,大規模で多様な多モーダルな英語ビデオコメントデータセットを構築した。 Twitchから抽出されたこのデータセットは、11のビデオカテゴリをカバーし、合計438時間と3200万のコメントがある。我々は,SFATモデルの有効性を,ライブビデオや進行中の会話コンテキストからコメントを生成する既存の手法と比較することによって実証する。

論文の概要: Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

関連論文リスト