Related papers: SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

URL: http://arxiv.org/abs/2502.06394v1
Date: Mon, 10 Feb 2025 12:30:25 GMT
Title: SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
Authors: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko,
Abstract summary: Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets.<n>We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
Score: 61.82799141938912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Related papers

LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification [44.86106619757571]
High-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation.<n>We propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification.<n>We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods.<n> Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency.
arXiv Detail & Related papers (2025-06-02T09:45:05Z)
Scaling Low-Resource MT via Synthetic Data Generation with LLMs [13.10398947215569]
This study focuses on seven diverse target languages.<n>We construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs.<n>We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT.
arXiv Detail & Related papers (2025-05-20T14:31:54Z)
Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models. LLMs-generated synthetic data often differs from the real data in key language attributes. We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z)
SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification [41.94295877935867]
This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. We fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task.
arXiv Detail & Related papers (2024-07-07T17:19:34Z)
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models [36.82189550072201]
Existing text-to-table datasets are typically oriented English. Large language models (LLMs) have shown great success as general task solvers in multi-lingual settings. We propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task.
arXiv Detail & Related papers (2024-05-20T16:58:02Z)
MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages [71.50809576484288]
Text detoxification is a task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recent approaches for parallel text detoxification corpora collection -- ParaDetox and APPADIA -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language.
arXiv Detail & Related papers (2024-04-02T15:32:32Z)
DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation [20.703102374139537]
We propose a novel text dataset distillation approach called Distilling dataset into Language Model (DiLM) DiLM trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples. Our code will be available at https://github.com/arumaekawa/DiLM.
arXiv Detail & Related papers (2024-03-30T06:40:54Z)
Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z)
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers. SAP assists the large language model (LLM) in generating informative queries in the target language. Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z)
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed. We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)
LLM-powered Data Augmentation for Enhanced Cross-lingual Performance [24.20730298894794]
This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in commonsense reasoning datasets. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. We evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data.
arXiv Detail & Related papers (2023-05-23T17:33:27Z)
mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries. We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.