Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
- URL: http://arxiv.org/abs/2505.12051v1
- Date: Sat, 17 May 2025 15:24:48 GMT
- Title: Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
- Authors: Yinghui Zhang, Tailin Chen, Yuchen Zhang, Zeyu Fu,
- Abstract summary: The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination.<n>Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature.<n>We present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism.
- Score: 7.728348842555291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the model's effectiveness in detecting hate videos. The source codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.
Related papers
- Consistency-aware Fake Videos Detection on Short Video Platforms [4.291448222735821]
This paper focuses on detecting fake news on the short video platforms.<n>Existing approaches typically combine raw video data with metadata inputs before applying a classification layer.<n>Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions.
arXiv Detail & Related papers (2025-04-30T10:26:04Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion [22.58710742780161]
CFSum is a transformer-based multi-modal video summarization framework with coarse-fine fusion.<n> CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework.
arXiv Detail & Related papers (2025-03-01T06:13:13Z) - Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content [7.5253808885104325]
Social media platforms enable the propagation of hateful content across different modalities.<n>Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored.<n>This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
arXiv Detail & Related papers (2025-02-11T00:07:40Z) - Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.<n>We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z) - On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.<n>We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z) - On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection [44.55891118519547]
We propose an innovative algorithm named Multi-Mod-al Detection(MM-Det) for detecting diffusion-generated content.<n>MM-Det utilizes the profound and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR)<n>MM-Det achieves state-of-the-art performance in Video Forensics (DVF)
arXiv Detail & Related papers (2024-10-31T04:20:47Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Denoising Bottleneck with Mutual Information Maximization for Video
Multimodal Fusion [30.631733395175765]
Video multimodal fusion aims to integrate multimodal signals in videos.
Video has longer multimodal sequences with more redundancy and noise in visual and audio modalities.
We propose a denoising bottleneck fusion model for fine-grained video fusion.
arXiv Detail & Related papers (2023-05-24T02:39:43Z) - Predicting the Popularity of Micro-videos with Multimodal Variational
Encoder-Decoder Framework [54.194340961353944]
We propose a multimodal variational encoder-decoder framework for micro-video popularity tasks.
MMVED learns a prediction embedding of a micro-video that is informative to its popularity level.
Experiments conducted on a public dataset and a dataset we collect from Xigua demonstrate the effectiveness of the proposed MMVED framework.
arXiv Detail & Related papers (2020-03-28T06:08:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.