Multi-modal Hate Speech Detection using Machine Learning
- URL: http://arxiv.org/abs/2307.11519v1
- Date: Thu, 15 Jun 2023 06:46:52 GMT
- Title: Multi-modal Hate Speech Detection using Machine Learning
- Authors: Fariha Tahosin Boishakhi, Ponkoj Chandra Shill, Md. Golam Rabiul Alam
- Abstract summary: A combined approach of multimodal system has been proposed to detect hate speech from video contents by extracting feature images, feature values extracted from the audio, text and used machine learning and Natural language processing.
- Score: 0.6793286055326242
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: With the continuous growth of internet users and media content, it is very
hard to track down hateful speech in audio and video. Converting video or audio
into text does not detect hate speech accurately as human sometimes uses
hateful words as humorous or pleasant in sense and also uses different voice
tones or show different action in the video. The state-ofthe-art hate speech
detection models were mostly developed on a single modality. In this research,
a combined approach of multimodal system has been proposed to detect hate
speech from video contents by extracting feature images, feature values
extracted from the audio, text and used machine learning and Natural language
processing.
Related papers
- Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning [4.136573141724715]
Hate speech on the internet poses a significant challenge to digital platform safety.
Recent research has developed detection models tailored to specific modalities.
This study conducts extensive experiments using few-shot in-context learning with large language models.
arXiv Detail & Related papers (2024-10-08T01:27:12Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - HateMM: A Multi-Modal Dataset for Hate Video Classification [8.758311170297192]
We build deep learning multi-modal models to classify the hate videos and observe that using all the modalities improves the overall hate speech detection performance.
Our work takes the first step toward understanding and modeling hateful videos on video hosting platforms such as BitChute.
arXiv Detail & Related papers (2023-05-06T03:39:00Z) - Emotion Based Hate Speech Detection using Multimodal Learning [0.0]
This paper proposes the first multimodal deep learning framework to combine the auditory features representing emotion and the semantic features to detect hateful content.
Our results demonstrate that incorporating emotional attributes leads to significant improvement over text-based models in detecting hateful multimedia content.
arXiv Detail & Related papers (2022-02-13T05:39:47Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.