AtteSTNet -- An attention and subword tokenization based approach for
code-switched text hate speech detection
- URL: http://arxiv.org/abs/2112.11479v3
- Date: Tue, 28 Mar 2023 08:38:01 GMT
- Title: AtteSTNet -- An attention and subword tokenization based approach for
code-switched text hate speech detection
- Authors: Geet Shingi, Vedangi Wagh, Kishor Wagh, Sharmila Wagh
- Abstract summary: Language used in social media is often a combination of English and the native language in the region.
In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language.
- Score: 1.3190581566723918
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in technology have led to a boost in social media usage
which has ultimately led to large amounts of user-generated data which also
includes hateful and offensive speech. The language used in social media is
often a combination of English and the native language in the region. In India,
Hindi is used predominantly and is often code-switched with English, giving
rise to the Hinglish (Hindi+English) language. Various approaches have been
made in the past to classify the code-mixed Hinglish hate speech using
different machine learning and deep learning-based techniques. However, these
techniques make use of recurrence on convolution mechanisms which are
computationally expensive and have high memory requirements. Past techniques
also make use of complex data processing making the existing techniques very
complex and non-sustainable to change in data. Proposed work gives a much
simpler approach which is not only at par with these complex networks but also
exceeds performance with the use of subword tokenization algorithms like BPE
and Unigram, along with multi-head attention-based techniques, giving an
accuracy of 87.41% and an F1 score of 0.851 on standard datasets. Efficient use
of BPE and Unigram algorithms help handle the nonconventional Hinglish
vocabulary making the proposed technique simple, efficient and sustainable to
use in the real world.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - A simple language-agnostic yet very strong baseline system for hate
speech and offensive content identification [0.0]
A system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed.
It reached a medium performance level in English, the language for which it is easy to develop deep learning approaches.
It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches.
arXiv Detail & Related papers (2022-02-05T08:09:09Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Integrating Knowledge in End-to-End Automatic Speech Recognition for
Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities.
This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z) - Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive
Content Identification in Indo-European Languages [2.927129789938848]
The HASOC track intends to develop and optimize Hate Speech detection algorithms for Hindi, German and English.
The dataset is collected from a Twitter archive and pre-classified by a machine learning system.
Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively.
arXiv Detail & Related papers (2021-08-12T19:02:53Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - TopicBERT: A Transformer transfer learning based memory-graph approach
for multimodal streaming social media topic detection [8.338441212378587]
Social networks with bursty short messages and their respective large data scale spread among vast variety of topics are research interest of many researchers.
These properties of social networks which are known as 5'Vs of big data has led to many unique and enlightenment algorithms and techniques applied to large social networking datasets and data streams.
arXiv Detail & Related papers (2020-08-16T10:39:50Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.