MHS-STMA: Multimodal Hate Speech Detection via Scalable   Transformer-Based Multilevel Attention Framework
        - URL: http://arxiv.org/abs/2409.05136v2
- Date: Tue, 17 Sep 2024 09:50:45 GMT
- Title: MHS-STMA: Multimodal Hate Speech Detection via Scalable   Transformer-Based Multilevel Attention Framework
- Authors: Anusha Chhabra, Dinesh Kumar Vishwakarma, 
- Abstract summary: This article proposes a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA)
It consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder.
Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy.
- Score: 15.647035299476894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues in recent years. Text and pictures are two forms of multimodal data that are distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. To address these shortcomings, the present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA). This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and handles multimodal data in a unique way. Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches. 
 
      
        Related papers
        - MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction   Network for Remote Sensing Change Detection [55.702662643521265]
 We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
 arXiv  Detail & Related papers  (2025-08-03T02:50:08Z)
- Multimodal Referring Segmentation: A Survey [93.24051010753817]
 Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.<n>Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models.
 arXiv  Detail & Related papers  (2025-08-01T02:14:00Z)
- METER: Multi-modal Evidence-based Thinking and Explainable Reasoning --   Algorithm and Benchmark [48.78602579128459]
 We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
 arXiv  Detail & Related papers  (2025-07-22T03:42:51Z)
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality   Synthetic Data [71.352883755806]
 Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.
However, the limited labeled multimodal data often hinders embedding performance.
Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
 arXiv  Detail & Related papers  (2025-02-12T15:03:33Z)
- Towards a Robust Framework for Multimodal Hate Detection: A Study on   Video vs. Image-based Content [7.5253808885104325]
 Social media platforms enable the propagation of hateful content across different modalities.
Recent approaches have shown promise in handling individual modalities, but their effectiveness across different modality combinations remains unexplored.
This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content.
 arXiv  Detail & Related papers  (2025-02-11T00:07:40Z)
- Multi-modal Stance Detection: New Datasets and Model [56.97470987479277]
 We study multi-modal stance detection for tweets consisting of texts and images.
We propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT)
TMPT achieves state-of-the-art performance in multi-modal stance detection.
 arXiv  Detail & Related papers  (2024-02-22T05:24:19Z)
- Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
 We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
 arXiv  Detail & Related papers  (2023-09-25T15:05:46Z)
- Hierarchical Audio-Visual Information Fusion with Multi-label Joint
  Decoding for MER 2023 [51.95161901441527]
 In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions.
Deep features extracted from foundation models are used as robust acoustic and visual representations of raw video.
Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
 arXiv  Detail & Related papers  (2023-09-11T03:19:10Z)
- Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
 Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
 arXiv  Detail & Related papers  (2023-07-19T02:11:19Z)
- Detecting and Grounding Multi-Modal Media Manipulation [32.34908534582532]
 We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
 arXiv  Detail & Related papers  (2023-04-05T16:20:40Z)
- Multi-modal Fake News Detection on Social Media via Multi-grained
  Information Fusion [21.042970740577648]
 We present a Multi-grained Multi-modal Fusion Network (MMFN) for fake news detection.
Inspired by the multi-grained process of human assessment of news authenticity, we respectively employ two Transformer-based pre-trained models to encode token-level features from text and images.
The multi-modal module fuses fine-grained features, taking into account coarse-grained features encoded by the CLIP encoder.
 arXiv  Detail & Related papers  (2023-04-03T09:13:59Z)
- Video Sentiment Analysis with Bimodal Information-augmented Multi-Head
  Attention [7.997124140597719]
 This study focuses on the sentiment analysis of videos containing time series data of multiple modalities.
The key problem is how to fuse these heterogeneous data.
Based on bimodal interaction, more important bimodal features are assigned larger weights.
 arXiv  Detail & Related papers  (2021-03-03T12:30:11Z)
- Detecting Hate Speech in Multi-modal Memes [14.036769355498546]
 We focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem.
We aim to solve the Facebook Meme Challenge citekiela 2020hateful which aims to solve a binary classification problem of predicting whether a meme is hateful or not.
 arXiv  Detail & Related papers  (2020-12-29T18:30:00Z)
- A Multimodal Framework for the Detection of Hateful Memes [16.7604156703965]
 We aim to develop a framework for the detection of hateful memes.
We show the effectiveness of upsampling of contrastive examples to encourage multimodality and ensemble learning.
Our best approach comprises an ensemble of UNITER-based models and achieves an AUROC score of 80.53, placing us 4th on phase 2 of the 2020 Hateful Memes Challenge organized by Facebook.
 arXiv  Detail & Related papers  (2020-12-23T18:37:11Z)
- Cross-Media Keyphrase Prediction: A Unified Framework with
  Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
 We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
 arXiv  Detail & Related papers  (2020-11-03T08:44:18Z)
- Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
 We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
 arXiv  Detail & Related papers  (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.