Multilingual and Multimodal Abuse Detection
- URL: http://arxiv.org/abs/2204.02263v1
- Date: Sun, 3 Apr 2022 13:28:58 GMT
- Title: Multilingual and Multimodal Abuse Detection
- Authors: Rini Sharon, Heet Shah, Debdoot Mukherjee, Vikram Gupta
- Abstract summary: This paper attempts abuse detection in conversational audio from a multimodal perspective in a multilingual social media setting.
Our proposed method, MADA, explicitly focuses on two modalities other than the audio itself.
We test the proposed approach on 10 different languages and observe consistent gains in the range 0.6%-5.2% by leveraging multiple modalities.
- Score: 3.4352862428120123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The presence of abusive content on social media platforms is undesirable as
it severely impedes healthy and safe social media interactions. While automatic
abuse detection has been widely explored in textual domain, audio abuse
detection still remains unexplored. In this paper, we attempt abuse detection
in conversational audio from a multimodal perspective in a multilingual social
media setting. Our key hypothesis is that along with the modelling of audio,
incorporating discriminative information from other modalities can be highly
beneficial for this task. Our proposed method, MADA, explicitly focuses on two
modalities other than the audio itself, namely, the underlying emotions
expressed in the abusive audio and the semantic information encapsulated in the
corresponding textual form. Observations prove that MADA demonstrates gains
over audio-only approaches on the ADIMA dataset. We test the proposed approach
on 10 different languages and observe consistent gains in the range 0.6%-5.2%
by leveraging multiple modalities. We also perform extensive ablation
experiments for studying the contributions of every modality and observe the
best results while leveraging all the modalities together. Additionally, we
perform experiments to empirically confirm that there is a strong correlation
between underlying emotions and abusive behaviour.
Related papers
- PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA)
To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements.
To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z) - Missingness-resilient Video-enhanced Multimodal Disfluency Detection [3.3281516035025285]
We present a practical multimodal disfluency detection approach that leverages available video data together with audio.
Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference.
In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods.
arXiv Detail & Related papers (2024-06-11T05:47:16Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Detecting and Grounding Multi-Modal Media Manipulation and Beyond [93.08116982163804]
We highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4)
DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content.
We propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities.
arXiv Detail & Related papers (2023-09-25T15:05:46Z) - Dynamic Causal Disentanglement Model for Dialogue Emotion Detection [77.96255121683011]
We propose a Dynamic Causal Disentanglement Model based on hidden variable separation.
This model effectively decomposes the content of dialogues and investigates the temporal accumulation of emotions.
Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables.
arXiv Detail & Related papers (2023-09-13T12:58:09Z) - An Empirical Study and Improvement for Speech Emotion Recognition [22.250228893114066]
Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text.
In this work, we consider a simple yet important problem: how to fuse audio and text modality information.
Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset.
arXiv Detail & Related papers (2023-04-08T03:24:06Z) - Hate Speech and Offensive Language Detection using an Emotion-aware
Shared Encoder [1.8734449181723825]
Existing works on hate speech and offensive language detection produce promising results based on pre-trained transformer models.
This paper addresses a multi-task joint learning approach which combines external emotional features extracted from another corpora.
Our findings demonstrate that emotional knowledge helps to more reliably identify hate speech and offensive language across datasets.
arXiv Detail & Related papers (2023-02-17T09:31:06Z) - DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach
for Violence Detection in Conversations [2.8038382295783943]
The choice of words and vocal cues in conversations presents an underexplored rich source of natural language data for personal safety and crime prevention.
We introduce a new information fusion approach that extracts and fuses multi-level features including verbal, vocal, and text as heterogeneous sources of information to detect the extent of violent behaviours in conversations.
arXiv Detail & Related papers (2022-06-23T16:45:50Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech [4.384576489684272]
We propose a novel approach to real-time sequence labeling in speech.
Our model treats speech and its own textual representation as two separate modalities or views.
We show significant gains of jointly learning from the two modalities when compared to text or audio only.
arXiv Detail & Related papers (2020-05-02T12:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.