Low-Resource Counterspeech Generation for Indic Languages: The Case of
Bengali and Hindi
- URL: http://arxiv.org/abs/2402.07262v1
- Date: Sun, 11 Feb 2024 18:09:50 GMT
- Title: Low-Resource Counterspeech Generation for Indic Languages: The Case of
Bengali and Hindi
- Authors: Mithun Das, Saurabh Kumar Pandey, Shivansh Sethi, Punyajoy Saha,
Animesh Mukherjee
- Abstract summary: We bridge the gap for low-resource languages such as Bengali and Hindi.
We create a benchmark dataset of 5,062 abusive speech/counterspeech pairs.
We observe that the monolingual setup yields the best performance.
- Score: 11.117463901375602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of online abuse, the NLP community has begun investigating the
use of neural architectures to generate counterspeech that can "counter" the
vicious tone of such abusive speech and dilute/ameliorate their rippling effect
over the social network. However, most of the efforts so far have been
primarily focused on English. To bridge the gap for low-resource languages such
as Bengali and Hindi, we create a benchmark dataset of 5,062 abusive
speech/counterspeech pairs, of which 2,460 pairs are in Bengali and 2,602 pairs
are in Hindi. We implement several baseline models considering various
interlingual transfer mechanisms with different configurations to generate
suitable counterspeech to set up an effective benchmark. We observe that the
monolingual setup yields the best performance. Further, using synthetic
transfer, language models can generate counterspeech to some extent;
specifically, we notice that transferability is better when languages belong to
the same language family.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Predicting positive transfer for improved low-resource speech
recognition using acoustic pseudo-tokens [31.83988006684616]
We show that supplementing the target language with data from a similar, higher-resource 'donor' language can help.
For example, continued pre-training on only 10 hours of low-resource Punjabi supplemented with 60 hours of donor Hindi is almost as good as continued pretraining on 70 hours of Punjabi.
arXiv Detail & Related papers (2024-02-03T23:54:03Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages [0.6526824510982802]
This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati.
The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content.
We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
arXiv Detail & Related papers (2023-10-03T17:53:09Z) - Model Adaptation for ASR in low-resource Indian Languages [28.02064068964355]
Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models like wav2vec2 and large-scale multi-lingual training like Whisper.
A huge challenge still exists for low-resource languages where the availability of both audio and text is limited.
This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages.
It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora
arXiv Detail & Related papers (2023-07-16T05:25:51Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages [34.79533646549939]
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning.
Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning.
arXiv Detail & Related papers (2021-09-22T06:37:39Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.