TextMachina: Seamless Generation of Machine-Generated Text Datasets
- URL: http://arxiv.org/abs/2401.03946v2
- Date: Fri, 12 Apr 2024 09:52:05 GMT
- Title: TextMachina: Seamless Generation of Machine-Generated Text Datasets
- Authors: Areg Mikael Sarvazyan, José Ángel González, Marc Franco-Salvador,
- Abstract summary: TextMachina is a Python framework designed to aid in the creation of high-quality, unbiased datasets.
It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets.
The quality of the datasets generated by TextMachina has been assessed in previous works.
- Score: 2.4578723416255754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, mixcase, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.
Related papers
- On the Generalization Ability of Machine-Generated Text Detectors [23.434925348283617]
Large language models (LLMs) has raised concerns about machine-generated text (MGT)
This work investigates the generalization capabilities of MGT detectors in three aspects.
arXiv Detail & Related papers (2024-12-23T03:30:34Z) - GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck.
Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts.
Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z) - A Lightweight Multi Aspect Controlled Text Generation Solution For Large Language Models [12.572046828830699]
Large language models (LLMs) show remarkable abilities with instruction tuning.
They fail to achieve ideal tasks when lacking high-quality instruction tuning data on target tasks.
arXiv Detail & Related papers (2024-10-18T03:32:00Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Synthetic Multimodal Question Generation [60.33494376081317]
Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents.
We propose SMMQG, a synthetic data generation framework that generates question and answer pairs directly from multimodal documents.
We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it.
arXiv Detail & Related papers (2024-07-02T12:57:42Z) - DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
We propose a dataset curation agent benchmark, DCA-Bench, to measure large language models' capability of detecting hidden dataset quality issues.
Specifically, we collect diverse real-world dataset quality issues from eight open dataset platforms as a testbed.
The proposed benchmark can also serve as a testbed for measuring the capability of LLMs in problem discovery rather than just problem-solving.
arXiv Detail & Related papers (2024-06-11T14:02:23Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.