Automatic Creation of Named Entity Recognition Datasets by Querying
Phrase Representations
- URL: http://arxiv.org/abs/2210.07586v4
- Date: Thu, 1 Jun 2023 06:26:46 GMT
- Title: Automatic Creation of Named Entity Recognition Datasets by Querying
Phrase Representations
- Authors: Hyunjae Kim, Jaehyo Yoo, Seunghyun Yoon, Jaewoo Kang
- Abstract summary: Most weakly supervised named entity recognition models rely on domain-specific dictionaries provided by experts.
We present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries.
We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
- Score: 20.00016240535205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most weakly supervised named entity recognition (NER) models rely on
domain-specific dictionaries provided by experts. This approach is infeasible
in many domains where dictionaries do not exist. While a phrase retrieval model
was used to construct pseudo-dictionaries with entities retrieved from
Wikipedia automatically in a recent study, these dictionaries often have
limited coverage because the retriever is likely to retrieve popular entities
rather than rare ones. In this study, we present a novel framework, HighGEN,
that generates NER datasets with high-coverage pseudo-dictionaries.
Specifically, we create entity-rich dictionaries with a novel search method,
called phrase embedding search, which encourages the retriever to search a
space densely populated with various entities. In addition, we use a new
verification process based on the embedding distance between candidate entity
mentions and entity types to reduce the false-positive noise in weak labels
generated by high-coverage dictionaries. We demonstrate that HighGEN
outperforms the previous best model by an average F1 score of 4.7 across five
NER benchmark datasets.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Large-Scale Label Interpretation Learning for Few-Shot Named Entity Recognition [5.262708162539423]
Few-shot named entity recognition (NER) detects named entities within text using only a few examples.
One promising line of research is to leverage natural language descriptions of each entity type.
In this paper, we explore the impact of a strong semantic prior to interpret verbalizations of new entity types.
arXiv Detail & Related papers (2024-03-21T08:22:44Z) - Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression.
For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z) - Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset [6.633914491587503]
We propose to generate a synthetic context retrieval training dataset using Alpaca.
Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER.
We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
arXiv Detail & Related papers (2023-10-16T06:53:12Z) - PromptNER: A Prompting Method for Few-shot Named Entity Recognition via
k Nearest Neighbor Search [56.81939214465558]
We propose PromptNER: a novel prompting method for few-shot NER via k nearest neighbor search.
We use prompts that contains entity category information to construct label prototypes, which enables our model to fine-tune with only the support set.
Our approach achieves excellent transfer learning ability, and extensive experiments on the Few-NERD and CrossNER datasets demonstrate that our model achieves superior performance over state-of-the-art methods.
arXiv Detail & Related papers (2023-05-20T15:47:59Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - The Fellowship of the Authors: Disambiguating Names from Social Network
Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities.
We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods.
We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - A Realistic Study of Auto-regressive Language Models for Named Entity
Typing and Recognition [7.345578385749421]
We study pre-trained language models for named entity recognition in a meta-learning setup.
First, we test named entity typing (NET) in a zero-shot transfer scenario. Then, we perform NER by giving few examples at inference.
We propose a method to select seen and rare / unseen names when having access only to the pre-trained model and report results on these groups.
arXiv Detail & Related papers (2021-08-26T15:29:00Z) - Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains.
We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.