MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework
- URL: http://arxiv.org/abs/2409.05136v2
- Date: Tue, 17 Sep 2024 09:50:45 GMT
- Title: MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework
- Authors: Anusha Chhabra, Dinesh Kumar Vishwakarma,
- Abstract summary: This article proposes a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA)
It consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder.
Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy.
- Score: 15.647035299476894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues in recent years. Text and pictures are two forms of multimodal data that are distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. To address these shortcomings, the present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA). This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and handles multimodal data in a unique way. Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches.
Related papers
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z) - Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content [7.5253808885104325]
Social media platforms enable the propagation of hateful content across different modalities.
Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored.
This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
arXiv Detail & Related papers (2025-02-11T00:07:40Z) - Multi-modal Stance Detection: New Datasets and Model [56.97470987479277]
We study multi-modal stance detection for tweets consisting of texts and images.
We propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT)
TMPT achieves state-of-the-art performance in multi-modal stance detection.
arXiv Detail & Related papers (2024-02-22T05:24:19Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Multi-modal Fake News Detection on Social Media via Multi-grained
Information Fusion [21.042970740577648]
We present a Multi-grained Multi-modal Fusion Network (MMFN) for fake news detection.
Inspired by the multi-grained process of human assessment of news authenticity, we respectively employ two Transformer-based pre-trained models to encode token-level features from text and images.
The multi-modal module fuses fine-grained features, taking into account coarse-grained features encoded by the CLIP encoder.
arXiv Detail & Related papers (2023-04-03T09:13:59Z) - Video Sentiment Analysis with Bimodal Information-augmented Multi-Head
Attention [7.997124140597719]
This study focuses on the sentiment analysis of videos containing time series data of multiple modalities.
The key problem is how to fuse these heterogeneous data.
Based on bimodal interaction, more important bimodal features are assigned larger weights.
arXiv Detail & Related papers (2021-03-03T12:30:11Z) - Detecting Hate Speech in Multi-modal Memes [14.036769355498546]
We focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem.
We aim to solve the Facebook Meme Challenge citekiela 2020hateful which aims to solve a binary classification problem of predicting whether a meme is hateful or not.
arXiv Detail & Related papers (2020-12-29T18:30:00Z) - A Multimodal Framework for the Detection of Hateful Memes [16.7604156703965]
We aim to develop a framework for the detection of hateful memes.
We show the effectiveness of upsampling of contrastive examples to encourage multimodality and ensemble learning.
Our best approach comprises an ensemble of UNITER-based models and achieves an AUROC score of 80.53, placing us 4th on phase 2 of the 2020 Hateful Memes Challenge organized by Facebook.
arXiv Detail & Related papers (2020-12-23T18:37:11Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.