Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue
- URL: http://arxiv.org/abs/2310.11318v1
- Date: Tue, 17 Oct 2023 14:52:33 GMT
- Title: Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue
- Authors: Shiwei Zhang, Mingfang Wu, Xiuzhen Zhang
- Abstract summary: In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
- Score: 18.325675189960833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In support of open and reproducible research, there has been a rapidly
increasing number of datasets made available for research. As the availability
of datasets increases, it becomes more important to have quality metadata for
discovering and reusing them. Yet, it is a common issue that datasets often
lack quality metadata due to limited resources for data curation. Meanwhile,
technologies such as artificial intelligence and large language models (LLMs)
are progressing rapidly. Recently, systems based on these technologies, such as
ChatGPT, have demonstrated promising capabilities for certain data curation
tasks. This paper proposes to leverage LLMs for cost-effective annotation of
subject metadata through the LLM-based in-context learning. Our method employs
GPT-3.5 with prompts designed for annotating subject metadata, demonstrating
promising performance in automatic metadata annotation. However, models based
on in-context learning cannot acquire discipline-specific rules, resulting in
lower performance in several categories. This limitation arises from the
limited contextual information available for subject inference. To the best of
our knowledge, we are introducing, for the first time, an in-context learning
method that harnesses large language models for automated subject metadata
annotation.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment [0.0]
We propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini.
We evaluate the impact of contextual information (i.e., dataset description) on the classification outcomes.
arXiv Detail & Related papers (2024-03-01T10:01:36Z) - Large Language Models for Data Annotation: A Survey [49.8318827245266]
The emergence of advanced Large Language Models (LLMs) presents an unprecedented opportunity to automate the complicated process of data annotation.
This survey includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Metadata Shaping: Natural Language Annotations for the Tail [4.665656172490747]
Language models (LMs) have made remarkable progress, but still struggle to generalize beyond the training data to rare linguistic patterns.
We propose metadata shaping, a method in which readily available metadata, such as entity descriptions and categorical tags, are appended to examples based on information theoretic metrics.
With no changes to the LM whatsoever, metadata shaping exceeds the BERT-baseline by up to 5.3 F1 points, and achieves or competes with state-of-the-art results.
arXiv Detail & Related papers (2021-10-16T01:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.