Toxicity Classification in Ukrainian
- URL: http://arxiv.org/abs/2404.17841v1
- Date: Sat, 27 Apr 2024 09:20:13 GMT
- Title: Toxicity Classification in Ukrainian
- Authors: Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, Georg Groh,
- Abstract summary: labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process.
In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)translating from an English corpus, (ii)filtering toxic samples using keywords, and (iii)annotating with crowdsourcing.
- Score: 11.847477933042777
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.
Related papers
- Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties [23.777874316083984]
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs.
We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties.
We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency.
arXiv Detail & Related papers (2024-11-17T03:53:24Z) - PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models [27.996123856250065]
Existing toxicity benchmarks are overwhelmingly focused on English.
We introduce PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages.
arXiv Detail & Related papers (2024-05-15T14:22:33Z) - Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches [11.508759658889382]
There is a tremendous lack of Ukrainian corpora for typical text classification tasks.
We explore cross-lingual knowledge transfer methods avoiding manual data curation.
We test the approaches on three text classification tasks.
arXiv Detail & Related papers (2024-04-02T15:37:09Z) - From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models [10.807067327137855]
As language models embrace multilingual capabilities, it's crucial our safety measures keep pace.
In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques.
This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation.
arXiv Detail & Related papers (2024-03-06T17:51:43Z) - Exploring Methods for Cross-lingual Text Style Transfer: The Case of
Text Detoxification [77.45995868988301]
Text detoxification is the task of transferring the style of text from toxic to neutral.
We present a large-scale study of strategies for cross-lingual text detoxification.
arXiv Detail & Related papers (2023-11-23T11:40:28Z) - Toxicity Detection with Generative Prompt-based Inference [3.9741109244650823]
It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity.
In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering.
arXiv Detail & Related papers (2022-05-24T22:44:43Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.