Contextual Hate Speech Detection in Code Mixed Text using Transformer
Based Approaches
- URL: http://arxiv.org/abs/2110.09338v1
- Date: Mon, 18 Oct 2021 14:05:36 GMT
- Title: Contextual Hate Speech Detection in Code Mixed Text using Transformer
Based Approaches
- Authors: Ravindra Nayak and Raviraj Joshi
- Abstract summary: We propose automated techniques for hate speech detection in code mixed text from Twitter.
While regular approaches analyze the text independently, we also make use of content text in the form of parent tweets.
We show that the dual-encoder approach using independent representations yields better performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the recent past, social media platforms have helped people in connecting
and communicating to a wider audience. But this has also led to a drastic
increase in cyberbullying. It is essential to detect and curb hate speech to
keep the sanity of social media platforms. Also, code mixed text containing
more than one language is frequently used on these platforms. We, therefore,
propose automated techniques for hate speech detection in code mixed text from
scraped Twitter. We specifically focus on code mixed English-Hindi text and
transformer-based approaches. While regular approaches analyze the text
independently, we also make use of content text in the form of parent tweets.
We try to evaluate the performances of multilingual BERT and Indic-BERT in
single-encoder and dual-encoder settings. The first approach is to concatenate
the target text and context text using a separator token and get a single
representation from the BERT model. The second approach encodes the two texts
independently using a dual BERT encoder and the corresponding representations
are averaged. We show that the dual-encoder approach using independent
representations yields better performance. We also employ simple ensemble
methods to further improve the performance. Using these methods we were able to
achieve the best F1 score of 73.07% on the HASOC 2021 ICHCL code mixed data
set.
Related papers
- Code-Mixed Text to Speech Synthesis under Low-Resource Constraints [6.544954579068865]
We describe our approaches for production quality code-mixed Hindi-English TTS systems built for e-commerce applications.
We propose a data-oriented approach by utilizing monolingual data sets in individual languages.
We show that such single script bi-lingual training without any code-mixing works well for pure code-mixed test sets.
arXiv Detail & Related papers (2023-12-02T10:40:38Z) - Augmenting text for spoken language understanding with Large Language
Models [13.240782495441275]
We show how to use transcript-semantic parse data (unpaired text) without corresponding speech.
Experiments show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively.
We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains.
arXiv Detail & Related papers (2023-09-17T22:25:34Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - CMSAOne@Dravidian-CodeMix-FIRE2020: A Meta Embedding and Transformer
model for Code-Mixed Sentiment Analysis on Social Media Text [9.23545668304066]
Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence.
Sentiment analysis (SA) is a fundamental step in NLP and is well studied in the monolingual text.
This paper proposes a meta embedding with a transformer method for sentiment analysis on the Dravidian code-mixed dataset.
arXiv Detail & Related papers (2021-01-22T08:48:27Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment
Classification Using Candidate Sentence Generation and Selection [1.2301855531996841]
Code-mixing adds to the challenge of analyzing the sentiment of the text due to the non-standard writing style.
We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier.
The proposed approach shows an improvement in the system performance as compared to the Bi-LSTM based neural classifier.
arXiv Detail & Related papers (2020-06-25T14:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.