A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction
- URL: http://arxiv.org/abs/2510.12306v1
- Date: Tue, 14 Oct 2025 09:06:14 GMT
- Title: A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction
- Authors: Cameron Morin, Matti Marttinen Larsson,
- Abstract summary: We present a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs)<n>Our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation.<n>Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.
Related papers
- LATA: A Tool for LLM-Assisted Translation Annotation [0.0]
This paper introduces a novel, LLM-assisted interactive tool to reduce the gap between automation and rigorous precision required for expert human judgment.<n>Unlike traditional statistical scalable, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment.
arXiv Detail & Related papers (2026-02-11T02:49:01Z) - When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages [9.138590152838754]
Segment-level quality estimation (QE) is a challenging cross-lingual language understanding task.<n>We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios.<n>Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models.
arXiv Detail & Related papers (2025-01-08T12:54:05Z) - Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'<n>We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.<n>We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks.
We propose the TasTe framework, which stands for translating through self-reflection.
The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z) - Human-in-the-loop Machine Translation with Large Language Model [44.86068991765771]
Large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities.
We propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions.
We evaluate the proposed pipeline using GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation.
arXiv Detail & Related papers (2023-10-13T07:30:27Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z) - Assessing the potential of LLM-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apology [9.941695905504282]
This study explores the possibility of using large language models (LLMs) to automate pragma-discursive corpus annotation.<n>We find that GPT-4 outperformed GPT-3.5, with accuracy approaching that of a human coder.
arXiv Detail & Related papers (2023-05-15T04:10:13Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.