ESNERA: Empirical and semantic named entity alignment for named entity dataset merging
- URL: http://arxiv.org/abs/2508.06877v1
- Date: Sat, 09 Aug 2025 08:15:26 GMT
- Title: ESNERA: Empirical and semantic named entity alignment for named entity dataset merging
- Authors: Xiaobo Zhang, Congqing He, Ying He, Jian Peng, Dajie Fu, Tien-Ping Tan,
- Abstract summary: We propose an automatic label alignment method based on label similarity.<n> Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain.
- Score: 12.244023090963704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.
Related papers
- Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs [0.09471093245585005]
Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information.<n>Current approaches may fall short in scenarios where diverse and complex contexts need to be integrated.<n>We propose a novel KG integration method consisting of label matching and triple matching.
arXiv Detail & Related papers (2025-07-20T07:46:55Z) - SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment [13.487673375206276]
This paper presents a soft label propagation framework that integrates multi-source data and iterative seed enhancement.
A bidirectional weighted joint loss function is implemented, which reduces the distance between positive samples and differentially processes negative samples.
Our method outperforms existing semi-supervised approaches, as evidenced by superior results on multiple datasets.
arXiv Detail & Related papers (2024-10-28T04:50:46Z) - DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition [53.019885776033824]
We propose DynamicNER, the first NER dataset designed for Large Language Models (LLMs)-based methods with dynamic categorization.<n>The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains.<n>Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods.
arXiv Detail & Related papers (2024-09-17T09:32:12Z) - Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition [40.23783832224238]
We present B2NERD, a compact dataset designed to guide LLMs' generalization in Open NER.<n>B2NERD is refined from 54 existing English and Chinese datasets using a two-step process.<n> Comprehensive evaluation shows that B2NERD significantly enhances LLMs' Open NER capabilities.
arXiv Detail & Related papers (2024-06-17T03:57:35Z) - Named Entity Recognition via Machine Reading Comprehension: A Multi-Task
Learning Approach [50.12455129619845]
Named Entity Recognition (NER) aims to extract and classify entity mentions in the text into pre-defined types.
We propose to incorporate the label dependencies among entity types into a multi-task learning framework for better MRC-based NER.
arXiv Detail & Related papers (2023-09-20T03:15:05Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Dynamic Named Entity Recognition [5.9401550252715865]
We introduce a new task: Dynamic Named Entity Recognition (DNER)
DNER provides a framework to better evaluate the ability of algorithms to extract entities by exploiting the context.
We evaluate baseline models and present experiments reflecting issues and research axes related to this novel task.
arXiv Detail & Related papers (2023-02-16T15:50:02Z) - Optimizing Bi-Encoder for Named Entity Recognition via Contrastive
Learning [80.36076044023581]
We present an efficient bi-encoder framework for named entity recognition (NER)
We frame NER as a metric learning problem that maximizes the similarity between the vector representations of an entity mention and its type.
A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions.
arXiv Detail & Related papers (2022-08-30T23:19:04Z) - Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised
Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution.
Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences.
We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z) - Named Entity Recognition without Labelled Data: A Weak Supervision
Approach [23.05371427663683]
This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision.
The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain.
A sequence labelling model can finally be trained on the basis of this unified annotation.
arXiv Detail & Related papers (2020-04-30T12:29:55Z) - Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains.
We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.