Fugu-MT 論文翻訳(概要): MM-HSD: Multi-Modal Hate Speech Detection in Videos

論文の概要: MM-HSD: Multi-Modal Hate Speech Detection in Videos

arxiv url: http://arxiv.org/abs/2508.20546v1
Date: Thu, 28 Aug 2025 08:36:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.228803
Title: MM-HSD: Multi-Modal Hate Speech Detection in Videos
Title（参考訳）: MM-HSD:ビデオにおけるマルチモーダルヘイト音声検出
Authors: Berta Céspedes-Sarrias, Carlos Collado-Capell, Pablo Rodenas-Ruiz, Olena Hrynenko, Andrea Cavallaro,
Abstract要約: ビデオにおけるヘイトスピーチ検出のためのマルチモーダルモデルMM-HSDを提案する。音声書き起こしやフレーム(例えば画面上のテキスト)から派生したビデオフレーム、音声、テキストを、CMA(Cross-Modal Attention)によって抽出された特徴と統合する。我々のアプローチは、オンスクリーンテキストをクエリとして使用し、残りのモダリティをキーとして使用する場合のパフォーマンスを向上させる。
参考スコア（独自算出の注目度）: 13.518681647462627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at https://github.com/idiap/mm-hsd
Abstract（参考訳）: ヘイトスピーチ検出(HSD)はテキストで広く研究されているが、既存のマルチモーダルアプローチは、特にビデオでは限られている。モダリティは必ずしも個々に意味のあるものではないので、単純な融合法はモダリティ間の依存関係を完全に把握できない。さらに、以前の作品では、微妙な憎しみのある内容を含むような、画面上のテキストやオーディオのような関連するモダリティを省略することが多く、個々に、そして他のものと組み合わせて、不可欠な手がかりを提供する。本稿では,ビデオのフレーム,音声,テキストを,音声の文字起こしやフレーム(即ち画面上テキスト)から抽出したテキストと,CMA(Cross-Modal Attention)によって抽出された特徴とを統合したHSDのマルチモーダルモデルであるMM-HSDを提案する。我々は、ビデオにおけるHSDの早期特徴抽出器としてCMAを初めて使用し、クエリ/キー構成を体系的に比較し、CMAブロック内の異なるモーダル間の相互作用を評価する。我々のアプローチは、オンスクリーンテキストをクエリとして使用し、残りのモダリティをキーとして使用する場合のパフォーマンスを向上させる。 HateMMデータセットを用いた実験の結果,MM-HSDは転写,音声,ビデオ,オンスクリーンテキスト,CMAの結合を用いて,M-F1スコア(0.874)における最先端の手法よりも優れており,モダリティの生埋め込みにおける特徴抽出が可能であることがわかった。コードはhttps://github.com/idiap/mm-hsdで入手できる。

論文の概要: MM-HSD: Multi-Modal Hate Speech Detection in Videos

関連論文リスト