Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
- URL: http://arxiv.org/abs/2310.10118v3
- Date: Mon, 8 Apr 2024 13:53:48 GMT
- Title: Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
- Authors: Arthur Amalvy, Vincent Labatut, Richard Dufour,
- Abstract summary: We propose to generate a synthetic context retrieval training dataset using Alpaca.
Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER.
We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
- Score: 6.633914491587503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Synergizing Unsupervised and Supervised Learning: A Hybrid Approach for Accurate Natural Language Task Modeling [0.0]
This paper presents a novel hybrid approach that synergizes unsupervised and supervised learning to improve the accuracy of NLP task modeling.
Our methodology integrates an unsupervised module that learns representations from unlabeled corpora and a supervised module that leverages these representations to enhance task-specific models.
By synergizing techniques, our hybrid approach achieves SOTA results on benchmark datasets, paving the way for more data-efficient and robust NLP systems.
arXiv Detail & Related papers (2024-06-03T08:31:35Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - Dynamic Named Entity Recognition [5.9401550252715865]
We introduce a new task: Dynamic Named Entity Recognition (DNER)
DNER provides a framework to better evaluate the ability of algorithms to extract entities by exploiting the context.
We evaluate baseline models and present experiments reflecting issues and research axes related to this novel task.
arXiv Detail & Related papers (2023-02-16T15:50:02Z) - T-NER: An All-Round Python Library for Transformer-based Named Entity
Recognition [9.928025283928282]
T-NER is a Python library for NER LM finetuning.
We show the potential of the library by compiling nine public NER datasets into a unified format.
To facilitate future research, we also release all our LM checkpoints via the Hugging Face model hub.
arXiv Detail & Related papers (2022-09-09T15:00:38Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.