Multimodal Hate Detection Using Dual-Stream Graph Neural Networks
- URL: http://arxiv.org/abs/2509.13515v1
- Date: Tue, 16 Sep 2025 20:20:05 GMT
- Title: Multimodal Hate Detection Using Dual-Stream Graph Neural Networks
- Authors: Jiangbei Yue, Shuonan Yang, Tailin Chen, Jianbo Jiao, Zeyu Fu,
- Abstract summary: Hateful videos present serious risks to online safety and real-world well-being.<n>Although multimodal classification approaches integrate information from several modalities, they typically neglect that even minimal hateful content defines a video's category.<n>We propose a novel multimodal dual-stream graph neural network model that captures structured information in videos.
- Score: 20.082029756403976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.
Related papers
- Language-Guided Graph Representation Learning for Video Summarization [96.2763459348758]
We propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization.<n>Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies.<n>Our method outperforms existing approaches across multiple benchmarks.
arXiv Detail & Related papers (2025-11-14T04:35:48Z) - Show-o2: Improved Native Unified Multimodal Models [57.34173415412808]
Show-o2 is a native unified multimodal models that leverage autoregressive modeling and flow matching.<n>Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion.
arXiv Detail & Related papers (2025-06-18T15:39:15Z) - Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion [7.728348842555291]
The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination.<n>Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature.<n>We present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism.
arXiv Detail & Related papers (2025-05-17T15:24:48Z) - Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content [7.5253808885104325]
Social media platforms enable the propagation of hateful content across different modalities.<n>Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored.<n>This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
arXiv Detail & Related papers (2025-02-11T00:07:40Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos [0.40778318140713216]
This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos.
We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
arXiv Detail & Related papers (2022-09-13T00:01:23Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - Cross-modal Learning for Multi-modal Video Categorization [24.61762520189921]
Multi-modal machine learning (ML) models can process data in multiple modalities.
In this paper, we focus on the problem of video categorization using a multi-modal ML technique.
We show how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.
arXiv Detail & Related papers (2020-03-07T03:21:15Z) - Exploiting Temporal Coherence for Multi-modal Video Categorization [24.61762520189921]
In this paper, we focus on the problem of video categorization by using a multimodal approach.
We have developed a novel temporal coherence-based regularization approach, which applies to different types of models.
We demonstrate through experiments how our proposed multimodal video categorization models with temporal coherence out-perform strong state-of-the-art baseline models.
arXiv Detail & Related papers (2020-02-07T06:42:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.