Evaluation of the Automated Labeling Method for Taxonomic Nomenclature Through Prompt-Optimized Large Language Model
- URL: http://arxiv.org/abs/2503.10662v1
- Date: Sat, 08 Mar 2025 23:11:43 GMT
- Title: Evaluation of the Automated Labeling Method for Taxonomic Nomenclature Through Prompt-Optimized Large Language Model
- Authors: Keito Inoshita, Kota Nojiri, Haruto Sugeno, Takumi Taga,
- Abstract summary: This study evaluates the feasibility of automatic species name labeling using large language model (LLM)<n>The results indicate that LLM-based classification achieved high accuracy in Morphology, Geography, and People categories.<n>Future research will focus on improving accuracy through optimized few-shot learning and retrieval-augmented generation techniques.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific names of organisms consist of a genus name and a species epithet, with the latter often reflecting aspects such as morphology, ecology, distribution, and cultural background. Traditionally, researchers have manually labeled species names by carefully examining taxonomic descriptions, a process that demands substantial time and effort when dealing with large datasets. This study evaluates the feasibility of automatic species name labeling using large language model (LLM) by leveraging their text classification and semantic extraction capabilities. Using the spider name dataset compiled by Mammola et al., we compared LLM-based labeling results-enhanced through prompt engineering-with human annotations. The results indicate that LLM-based classification achieved high accuracy in Morphology, Geography, and People categories. However, classification accuracy was lower in Ecology & Behavior and Modern & Past Culture, revealing challenges in interpreting animal behavior and cultural contexts. Future research will focus on improving accuracy through optimized few-shot learning and retrieval-augmented generation techniques, while also expanding the applicability of LLM-based labeling to diverse biological taxa.
Related papers
- A novel approach to navigate the taxonomic hierarchy to address the Open-World Scenarios in Medicinal Plant Classification [0.0]
It is observed that existing methods for medicinal plant classification often fail to perform hierarchical classification and accurately identifying unknown species.<n>We propose a novel method, which integrates DenseNet121, Multi-Scale Self-Attention (MSSA) and cascaded classifiers for hierarchical classification.<n>Our proposed model size is almost four times less than the existing state of the art methods making it easily deploy able in real world application.
arXiv Detail & Related papers (2025-02-24T16:20:25Z) - Can Large Language Models Serve as Effective Classifiers for Hierarchical Multi-Label Classification of Scientific Documents at Industrial Scale? [1.0562108865927007]
Large Language Models (LLMs) have demonstrated great potential in complex tasks such as multi-label classification.<n>We present methods that combine the strengths of LLMs with dense retrieval techniques to overcome these challenges.<n>We evaluate the effectiveness of our methods on SSRN, a large repository of preprints spanning multiple disciplines.
arXiv Detail & Related papers (2024-12-06T15:51:22Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database [49.1574468325115]
We introduce textbfWhaleNet (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations.
We achieve an improvement in classification accuracy by $8-10%$ over existing architectures, corresponding to a classification accuracy of $97.61%$.
arXiv Detail & Related papers (2024-02-20T11:36:23Z) - Understanding Survey Paper Taxonomy about Large Language Models via
Graph Representation Learning [2.88268082568407]
We develop a method to automatically assign survey papers to a taxonomy.
Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models.
arXiv Detail & Related papers (2024-02-16T02:21:59Z) - TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled
Zero-shot Genome Classification [0.0]
A species' genetic code or genome encodes valuable evolutionary, biological, and phylogenetic information.
Traditional bioinformatics tools have made notable progress but lack scalability and are computationally expensive.
We propose addressing this problem through zero-shot learning using TEPI, taxonomy-aware Embedding and Pseudo-Imaging.
arXiv Detail & Related papers (2024-01-24T04:16:28Z) - A Saliency-based Clustering Framework for Identifying Aberrant
Predictions [49.1574468325115]
We introduce the concept of aberrant predictions, emphasizing that the nature of classification errors is as critical as their frequency.
We propose a novel, efficient training methodology aimed at both reducing the misclassification rate and discerning aberrant predictions.
We apply this methodology to the less-explored domain of veterinary radiology, where the stakes are high but have not been as extensively studied compared to human medicine.
arXiv Detail & Related papers (2023-11-11T01:53:59Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Knowledge Elicitation using Deep Metric Learning and Psychometric
Testing [15.989397781243225]
We provide a method for efficient hierarchical knowledge elicitation from experts working with high-dimensional data such as images or videos.
The developed models embed the high-dimensional data in a metric space where distances are semantically meaningful, and the data can be organized in a hierarchical structure.
arXiv Detail & Related papers (2020-04-14T08:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.