cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in
Under-resourced Languages
- URL: http://arxiv.org/abs/2401.15777v1
- Date: Sun, 28 Jan 2024 21:58:04 GMT
- Title: cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in
Under-resourced Languages
- Authors: Sidney G.-J. Wong and Matthew Durward
- Abstract summary: This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024.
We took a transformer-based approach to develop our multiclass classification model for ten language conditions.
We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes our homophobia/transphobia in social media comments
detection system developed as part of the shared task at LT-EDI-2024. We took a
transformer-based approach to develop our multiclass classification model for
ten language conditions (English, Spanish, Gujarati, Hindi, Kannada, Malayalam,
Marathi, Tamil, Tulu, and Telugu). We introduced synthetic and organic
instances of script-switched language data during domain adaptation to mirror
the linguistic realities of social media language as seen in the labelled
training data. Our system ranked second for Gujarati and Telugu with varying
levels of performance for other language conditions. The results suggest
incorporating elements of paralinguistic behaviour such as script-switching may
improve the performance of language detection systems especially in the cases
of under-resourced languages conditions.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media
Comments using Spatio-Temporally Retrained Language Models [0.9012198585960441]
This paper describes our multiclass classification system developed as part of the LTERAN@LP-2023 shared task.
We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions.
We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score.
arXiv Detail & Related papers (2023-08-20T21:30:34Z) - KIT's Multilingual Speech Translation System for IWSLT 2023 [58.5152569458259]
We describe our speech translation system for the multilingual track of IWSLT 2023.
The task requires translation into 10 languages of varying amounts of resources.
Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation.
arXiv Detail & Related papers (2023-06-08T16:13:20Z) - Data and knowledge-driven approaches for multilingual training to
improve the performance of speech recognition systems of Indian languages [0.0]
We propose data and knowledge-driven approaches for multilingual training of the automated speech recognition system for a target language.
In phone/senone mapping, deep neural network (DNN) learns to map senones or phones from one language to the others.
In the other approach, we model the acoustic information for all the languages simultaneously.
arXiv Detail & Related papers (2022-01-24T07:17:17Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.