Multi-Modal Discussion Transformer: Integrating Text, Images and Graph
Transformers to Detect Hate Speech on Social Media
- URL: http://arxiv.org/abs/2307.09312v4
- Date: Thu, 22 Feb 2024 06:17:39 GMT
- Title: Multi-Modal Discussion Transformer: Integrating Text, Images and Graph
Transformers to Detect Hate Speech on Social Media
- Authors: Liam Hebert, Gaurav Sahu, Yuxuan Guo, Nanda Kishore Sreenivas, Lukasz
Golab, Robin Cohen
- Abstract summary: We present the Multi-Modal Discussion Transformer (mDT), a novel methodfor detecting hate speech in online social networks such as Reddit discussions.
In contrast to traditional comment-only methods, our approach to labelling a comment as hate speech involves a holistic analysis of text and images grounded in the discussion context.
This is done by leveraging graph transformers to capture the contextual relationships in the discussion surrounding a comment and grounding the interwoven fusion layers that combine text and image embeddings instead of processing modalities separately.
- Score: 6.3756400508728515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the Multi-Modal Discussion Transformer (mDT), a novel methodfor
detecting hate speech in online social networks such as Reddit discussions. In
contrast to traditional comment-only methods, our approach to labelling a
comment as hate speech involves a holistic analysis of text and images grounded
in the discussion context. This is done by leveraging graph transformers to
capture the contextual relationships in the discussion surrounding a comment
and grounding the interwoven fusion layers that combine text and image
embeddings instead of processing modalities separately. To evaluate our work,
we present a new dataset, HatefulDiscussions, comprising complete multi-modal
discussions from multiple online communities on Reddit. We compare the
performance of our model to baselines that only process individual comments and
conduct extensive ablation studies.
Related papers
- Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Lexical Squad@Multimodal Hate Speech Event Detection 2023: Multimodal
Hate Speech Detection using Fused Ensemble Approach [0.23020018305241333]
We present our novel ensemble learning approach for detecting hate speech, by classifying text-embedded images into two labels, namely "Hate Speech" and "No Hate Speech"
Our proposed ensemble model yielded promising results with 75.21 and 74.96 as accuracy and F-1 score (respectively)
arXiv Detail & Related papers (2023-09-23T12:06:05Z) - Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective.
We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way.
We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z) - Composition and Deformance: Measuring Imageability with a Text-to-Image
Model [8.008504325316327]
We propose methods that use generated images to measure the imageability of single English words and connected text.
We find high correlation between the proposed computational measures of imageability and human judgments of individual words.
We discuss possible effects of model training and implications for the study of compositionality in text-to-image models.
arXiv Detail & Related papers (2023-06-05T18:22:23Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Predicting Hateful Discussions on Reddit using Graph Transformer
Networks and Communal Context [9.4337569682766]
We propose a system to predict harmful discussions on social media platforms.
Our solution uses contextual deep language models and integrates state-of-the-art Graph Transformer Networks.
We evaluate our approach on 333,487 Reddit discussions from various communities.
arXiv Detail & Related papers (2023-01-10T23:47:13Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Affective Feedback Synthesis Towards Multimodal Text and Image Data [12.768277167508208]
We have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image.
A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input.
The generated feedbacks have been analyzed using automatic and human evaluation.
arXiv Detail & Related papers (2022-03-23T19:28:20Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Diversifying Dialogue Generation with Non-Conversational Text [38.03510529185192]
We propose a new perspective to diversify dialogue generation by leveraging non-conversational text.
We collect a large-scale non-conversational corpus from multi sources including forum comments, idioms and book snippets.
The resulting model is tested on two conversational datasets and is shown to produce significantly more diverse responses without sacrificing the relevance with context.
arXiv Detail & Related papers (2020-05-09T02:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.