SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
- URL: http://arxiv.org/abs/2502.06394v1
- Date: Mon, 10 Feb 2025 12:30:25 GMT
- Title: SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
- Authors: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko,
- Abstract summary: Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets.
We introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset.
- Score: 61.82799141938912
- License:
- Abstract: Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.
Related papers
- Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models.
LLMs-generated synthetic data often differs from the real data in key language attributes.
We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z) - SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification [41.94295877935867]
This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team.
Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification.
We fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task.
arXiv Detail & Related papers (2024-07-07T17:19:34Z) - CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models [36.82189550072201]
Existing text-to-table datasets are typically oriented English.
Large language models (LLMs) have shown great success as general task solvers in multi-lingual settings.
We propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task.
arXiv Detail & Related papers (2024-05-20T16:58:02Z) - MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages [71.50809576484288]
Text detoxification is a task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register.
Recent approaches for parallel text detoxification corpora collection -- ParaDetox and APPADIA -- were explored only in monolingual setup.
In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language.
arXiv Detail & Related papers (2024-04-02T15:32:32Z) - DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation [20.703102374139537]
We propose a novel text dataset distillation approach called Distilling dataset into Language Model (DiLM)
DiLM trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples.
Our code will be available at https://github.com/arumaekawa/DiLM.
arXiv Detail & Related papers (2024-03-30T06:40:54Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.