Text classification of column headers with a controlled vocabulary:
leveraging LLMs for metadata enrichment
- URL: http://arxiv.org/abs/2403.00884v2
- Date: Tue, 5 Mar 2024 13:42:06 GMT
- Title: Text classification of column headers with a controlled vocabulary:
leveraging LLMs for metadata enrichment
- Authors: Margherita Martorana, Tobias Kuhn, Lise Stork, Jacco van Ossenbruggen
- Abstract summary: We propose a method to support metadata enrichment with topic annotations of column headers using three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard and GoogleGemini.
We evaluate our approach by assessing the internal consistency of the LLMs, the inter-machine alignment, and the human-machine agreement for the topic classification task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional dataset retrieval systems index on metadata information rather
than on the data values. Thus relying primarily on manual annotations and
high-quality metadata, processes known to be labour-intensive and challenging
to automate. We propose a method to support metadata enrichment with topic
annotations of column headers using three Large Language Models (LLMs):
ChatGPT-3.5, GoogleBard and GoogleGemini. We investigate the LLMs ability to
classify column headers based on domain-specific topics from a controlled
vocabulary. We evaluate our approach by assessing the internal consistency of
the LLMs, the inter-machine alignment, and the human-machine agreement for the
topic classification task. Additionally, we investigate the impact of
contextual information (i.e. dataset description) on the classification
outcomes. Our results suggest that ChatGPT and GoogleGemini outperform
GoogleBard for internal consistency as well as LLM-human-alignment.
Interestingly, we found that context had no impact on the LLMs performances.
This work proposes a novel approach that leverages LLMs for text classification
using a controlled topic vocabulary, which has the potential to facilitate
automated metadata enrichment, thereby enhancing dataset retrieval and the
Findability, Accessibility, Interoperability and Reusability (FAIR) of research
data on the Web.
Related papers
- MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - LLM-based Semantic Augmentation for Harmful Content Detection [5.954202581988127]
This paper introduces an approach that prompts large language models to clean noisy text and provide context-rich explanations.
We evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets.
Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models.
arXiv Detail & Related papers (2025-04-22T02:59:03Z) - Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs [1.1957520154275776]
Data catalogs serve as repositories for organizing and accessing diverse collection of data assets.
Many data catalogs within organizations suffer from limited searchability due to inadequate metadata like asset descriptions.
This paper explores the challenges associated with metadata creation and proposes a unique prompt enrichment idea of leveraging existing metadata content.
arXiv Detail & Related papers (2025-03-12T02:33:33Z) - Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts [1.565361244756411]
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks.
This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media.
arXiv Detail & Related papers (2024-11-22T12:37:41Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)
Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.
Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Element-aware Summarization with Large Language Models: Expert-aligned
Evaluation and Chain-of-Thought Method [35.181659789684545]
Automatic summarization generates concise summaries that contain key ideas of source documents.
References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy.
We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step.
Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
arXiv Detail & Related papers (2023-05-22T18:54:35Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.