Related papers: Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

URL: http://arxiv.org/abs/2406.11192v1
Date: Mon, 17 Jun 2024 03:57:35 GMT
Title: Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition
Authors: Yuming Yang, Wantong Zhao, Caishuang Huang, Junjie Ye, Xiao Wang, Huiyuan Zheng, Yang Nan, Yuran Wang, Xueying Xu, Kaixin Huang, Yunke Zhang, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract summary: We present B2NERD, a cohesive and efficient dataset for Open NER. We detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.
Score: 40.23783832224238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets faces issues due to inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain generalization. To address this, we present B2NERD, a cohesive and efficient dataset for Open NER, normalized from 54 existing English or Chinese datasets using a two-step approach. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly improves LLMs' generalization on Open NER. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.

Related papers

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages [9.114488614939619]
We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets. We standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual multi-ontology NER.
arXiv Detail & Related papers (2024-12-12T18:55:53Z)
Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning [32.62763647036567]
Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. We propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER) MsFNER splits the general NER into two stages: entity-span detection and entity classification.
arXiv Detail & Related papers (2024-04-10T12:31:09Z)
NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning. We present three variants of the NER task, together with a dataset to support them. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z)
Taxonomy Expansion for Named Entity Recognition [65.49344005894996]
Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types. We propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets.
arXiv Detail & Related papers (2023-05-22T16:23:46Z)
GPT-NER: Named Entity Recognition via Large Language Models [58.609582116612934]
GPT-NER transforms the sequence labeling task to a generation task that can be easily adapted by Language Models. We find that GPT-NER exhibits a greater ability in the low-resource and few-shot setups, when the amount of training data is extremely scarce. This demonstrates the capabilities of GPT-NER in real-world NER applications where the number of labeled examples is limited.
arXiv Detail & Related papers (2023-04-20T16:17:26Z)
T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition [9.928025283928282]
T-NER is a Python library for NER LM finetuning. We show the potential of the library by compiling nine public NER datasets into a unified format. To facilitate future research, we also release all our LM checkpoints via the Hugging Face model hub.
arXiv Detail & Related papers (2022-09-09T15:00:38Z)
Optimizing Bi-Encoder for Named Entity Recognition via Contrastive Learning [80.36076044023581]
We present an efficient bi-encoder framework for named entity recognition (NER) We frame NER as a metric learning problem that maximizes the similarity between the vector representations of an entity mention and its type. A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions.
arXiv Detail & Related papers (2022-08-30T23:19:04Z)
MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition [15.805414696789796]
We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios.
arXiv Detail & Related papers (2022-08-30T20:45:54Z)
Simple Questions Generate Named Entity Recognition Datasets [18.743889213075274]
This work introduces an ask-to-generate approach, which automatically generates NER datasets by asking simple natural language questions. Our models largely outperform previous weakly supervised models on six NER benchmarks across four different domains. Formulating the needs of NER with natural language also allows us to build NER models for fine-grained entity types such as Award.
arXiv Detail & Related papers (2021-12-16T11:44:38Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
An Open-Source Dataset and A Multi-Task Model for Malay Named Entity Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens) An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Cascaded Models for Better Fine-Grained Named Entity Recognition [10.03287972980716]
We present a cascaded approach to labeling fine-grained NER, applying to a newly released fine-grained NER dataset. We show that performance can be improved by about 20 F1 absolute, as compared with the straightforward model built on the full fine-grained types.
arXiv Detail & Related papers (2020-09-15T18:41:29Z)
Bipartite Flat-Graph Network for Nested Named Entity Recognition [94.91507634620133]
Bipartite flat-graph network (BiFlaG) for nested named entity recognition (NER) We propose a novel bipartite flat-graph network (BiFlaG) for nested named entity recognition (NER)
arXiv Detail & Related papers (2020-05-01T15:14:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.