Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
- URL: http://arxiv.org/abs/2403.15430v1
- Date: Wed, 13 Mar 2024 15:38:55 GMT
- Title: Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
- Authors: Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe,
- Abstract summary: We create datasets for named entity recognition and relation extraction via a two-stage process.
The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants.
Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
- Score: 27.266030092418642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
Related papers
- Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM [0.0]
The study utilizes the unique SPbLitGuide dataset, a collection of event announcements from Saint Petersburg spanning 1999 to 2019.<n>A comparative evaluation of diverse NER models is presented, encompassing established transformer-based architectures.<n>The research contributes to a deeper understanding of current NER model capabilities and limitations when applied to morphologically rich languages like Russian.
arXiv Detail & Related papers (2025-06-03T08:11:16Z) - Exploring a Large Language Model for Transforming Taxonomic Data into OWL: Lessons Learned and Implications for Ontology Development [63.74965026095835]
This paper investigates the use of ChatGPT-4 to automate the development of the :Organism module in the Agricultural Product Types Ontology (APTO) for species classification.
Our methodology involved leveraging ChatGPT-4 to extract data from the GBIF Backbone API and generate files for further integration in APTO.
Two alternative approaches were explored: (1) issuing a series of prompts for ChatGPT-4 to execute tasks via the BrowserOP plugin and (2) directing ChatGPT-4 to design a Python algorithm to perform tasks.
arXiv Detail & Related papers (2025-04-25T19:05:52Z) - GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition [0.06554326244334868]
We introduce GLiNER-BioMed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedicine.<n>In contrast to conventional approaches, GLiNER uses natural language labels to infer arbitrary entity types, enabling zero-shot recognition.<n> Experiments on several biomedical datasets demonstrate that GLiNER-BioMed outperforms the state-of-the-art in both zero-shot scenarios.
arXiv Detail & Related papers (2025-04-01T11:40:50Z) - Fake It Till Make It: Federated Learning with Consensus-Oriented
Generation [52.82176415223988]
We propose federated learning with consensus-oriented generation (FedCOG)
FedCOG consists of two key components at the client side: complementary data generation and knowledge-distillation-based model training.
Experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T18:49:59Z) - GPT Struct Me: Probing GPT Models on Narrative Entity Extraction [2.049592435988883]
We evaluate the capabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5 -- in the extraction of narrative entities.
This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles.
arXiv Detail & Related papers (2023-11-24T16:19:04Z) - Relation Extraction in underexplored biomedical domains: A
diversity-optimised sampling and synthetic data generation approach [0.0]
sparsity of labelled data is an obstacle to the development of Relation Extraction models.
We create the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets.
We evaluate the performance of standard fine-tuning as a generative task and few-shot learning with open Large Language Models.
arXiv Detail & Related papers (2023-11-10T19:36:00Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Large-Scale Text Analysis Using Generative Language Models: A Case Study
in Discovering Public Value Expressions in AI Patents [2.246222223318928]
This paper employs a novel approach using a generative language model (GPT-4) to produce labels and rationales for large-scale text analysis.
We collect a database comprising 154,934 patent documents using an advanced Boolean query submitted to InnovationQ+.
We design a framework for identifying and labeling public value expressions in these AI patent sentences.
arXiv Detail & Related papers (2023-05-17T17:18:26Z) - Memorization of Named Entities in Fine-tuned BERT Models [3.0177210416625115]
We investigate the extent of named entity memorization in fine-tuned BERT models.
We show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only.
arXiv Detail & Related papers (2022-12-07T16:20:50Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Paradigm selection for Data Fusion of SAR and Multispectral Sentinel
data applied to Land-Cover Classification [63.072664304695465]
In this letter, four data fusion paradigms, based on Convolutional Neural Networks (CNNs) are analyzed and implemented.
The goals are to provide a systematic procedure for choosing the best data fusion framework, resulting in the best classification results.
The procedure has been validated for land-cover classification but it can be transferred to other cases.
arXiv Detail & Related papers (2021-06-18T11:36:54Z) - Fine-tuning BERT-based models for Plant Health Bulletin Classification [0.0]
French Plants Health Bulletins (BSV) give information about the development stages of phytosanitary risks in agricultural production.
They are written in natural language, thus, machines and human cannot exploit them as efficiently as it could be.
Recent advancements Bidirectional Representations from Transformers (BERT) inspire us to rethink of knowledge representation and natural language understanding in plant health management domain.
arXiv Detail & Related papers (2021-01-29T08:14:35Z) - Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [84.64004917951547]
Fine-tuning pre-trained language models like BERT has become an effective way in NLP.
In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
arXiv Detail & Related papers (2020-02-24T16:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.