GPTs Are Multilingual Annotators for Sequence Generation Tasks
- URL: http://arxiv.org/abs/2402.05512v1
- Date: Thu, 8 Feb 2024 09:44:02 GMT
- Title: GPTs Are Multilingual Annotators for Sequence Generation Tasks
- Authors: Juhwan Choi, Eunju Lee, Kyohoon Jin, YoungBin Kim
- Abstract summary: This study proposes an autonomous annotation method by utilizing large language models.
We demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation.
- Score: 11.59128394819439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data annotation is an essential step for constructing new datasets. However,
the conventional approach of data annotation through crowdsourcing is both
time-consuming and expensive. In addition, the complexity of this process
increases when dealing with low-resource languages owing to the difference in
the language pool of crowdworkers. To address these issues, this study proposes
an autonomous annotation method by utilizing large language models, which have
been recently demonstrated to exhibit remarkable performance. Through our
experiments, we demonstrate that the proposed method is not just cost-efficient
but also applicable for low-resource language annotation. Additionally, we
constructed an image captioning dataset using our approach and are committed to
open this dataset for future study. We have opened our source code for further
study and reproducibility.
Related papers
- Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language [0.0]
This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language.
We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations.
arXiv Detail & Related papers (2024-10-17T08:10:24Z) - MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages.
MURI produces instruction-output pairs from existing human-written texts in low-resource languages.
Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z) - Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora [4.008122785948581]
Ambiguity in language presents challenges in developing more enhanced language models.
We introduce a self-curation method that preprocesses annotated datasets by leveraging proxy models trained directly on these datasets.
Our method enhances preference learning by automatically detecting and removing ambiguous annotations within the dataset.
arXiv Detail & Related papers (2024-08-23T02:27:14Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Selective Annotation Makes Language Models Better Few-Shot Learners [97.07544941620367]
Large language models can perform in-context learning, where they learn a new task from a few task demonstrations.
This work examines the implications of in-context learning for the creation of datasets for new natural language tasks.
We propose an unsupervised, graph-based selective annotation method, voke-k, to select diverse, representative examples to annotate.
arXiv Detail & Related papers (2022-09-05T14:01:15Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.