Effect of forename string on author name disambiguation
- URL: http://arxiv.org/abs/2102.03250v1
- Date: Fri, 5 Feb 2021 15:54:11 GMT
- Title: Effect of forename string on author name disambiguation
- Authors: Jinseok Kim and Jenna Kim
- Abstract summary: Author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author.
This study assesses the contributions of forenames in author name disambiguation using multiple labeled datasets.
- Score: 8.160343645537106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In author name disambiguation, author forenames are used to decide which name
instances are disambiguated together and how much they are likely to refer to
the same author. Despite such a crucial role of forenames, their effect on the
performances of heuristic (string matching) and algorithmic disambiguation is
not well understood. This study assesses the contributions of forenames in
author name disambiguation using multiple labeled datasets under varying ratios
and lengths of full forenames, reflecting real-world scenarios in which an
author is represented by forename variants (synonym) and some authors share the
same forenames (homonym). Results show that increasing the ratios of full
forenames improves substantially the performances of both heuristic and
machine-learning-based disambiguation. Performance gains by algorithmic
disambiguation are pronounced when many forenames are initialized or homonym is
prevalent. As the ratios of full forenames increase, however, they become
marginal compared to the performances by string matching. Using a small portion
of forename strings does not reduce much the performances of both heuristic and
algorithmic disambiguation compared to using full-length strings. These
findings provide practical suggestions such as restoring initialized forenames
into a full-string format via record linkage for improved disambiguation
performances.
Related papers
- Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Author Name Disambiguation via Heterogeneous Network Embedding from
Structural and Semantic Perspectives [13.266320447769564]
Name ambiguity is common in academic digital libraries, such as multiple authors having the same name.
The proposed method is mainly based on representation learning for heterogeneous networks and clustering.
The semantic representation is generated using NLP tools.
arXiv Detail & Related papers (2022-12-24T11:22:34Z) - Influence Functions for Sequence Tagging Models [49.81774968547377]
We extend influence functions to trace predictions back to the training points that informed them.
We show the practical utility of segment influence by using the method to identify systematic annotation errors.
arXiv Detail & Related papers (2022-10-25T17:13:11Z) - Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document.
Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy.
We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z) - The Fellowship of the Authors: Disambiguating Names from Social Network
Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities.
We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods.
We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z) - Bib2Auth: Deep Learning Approach for Author Disambiguation using
Bibliographic Data [4.817368273632451]
We propose a novel approach to link author names to their real-world entities by relying on their co-authorship pattern and area of research.
Our supervised deep learning model identifies an author by capturing his/her relationship with his/her co-authors and area of research.
Bib2Auth has shown good performance on a relatively large dataset.
arXiv Detail & Related papers (2021-07-09T12:25:11Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Pairwise Learning for Name Disambiguation in Large-Scale Heterogeneous
Academic Networks [81.00481125272098]
We introduce Multi-view Attention-based Pairwise Recurrent Neural Network (MA-PairRNN) to solve the name disambiguation problem.
MA-PairRNN combines heterogeneous graph embedding learning and pairwise similarity learning into a framework.
Results on two real-world datasets demonstrate that our framework has a significant and consistent improvement of performance on the name disambiguation task.
arXiv Detail & Related papers (2020-08-30T06:08:20Z) - Interpretability Analysis for Named Entity Recognition to Understand
System Predictions and How They Can Improve [49.878051587667244]
We examine the performance of several variants of LSTM-CRF architectures for named entity recognition.
We find that context representations do contribute to system performance, but that the main factor driving high performance is learning the name tokens themselves.
We enlist human annotators to evaluate the feasibility of inferring entity types from the context alone and find that, while people are not able to infer the entity type either for the majority of the errors made by the context-only system, there is some room for improvement.
arXiv Detail & Related papers (2020-04-09T14:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.