Detecting Entities in the Astrophysics Literature: A Comparison of
Word-based and Span-based Entity Recognition Methods
- URL: http://arxiv.org/abs/2211.13819v1
- Date: Thu, 24 Nov 2022 23:07:48 GMT
- Title: Detecting Entities in the Astrophysics Literature: A Comparison of
Word-based and Span-based Entity Recognition Methods
- Authors: Xiang Dai and Sarvnaz Karimi
- Abstract summary: We describe our entity recognition methods developed as part of the DEAL (Detecting Entities in the Astrophysics Literature) shared task.
The aim of the task is to build a system that can identify Named Entities in a dataset composed by scholarly articles from astrophysics literature.
- Score: 20.506920012146235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information Extraction from scientific literature can be challenging due to
the highly specialised nature of such text. We describe our entity recognition
methods developed as part of the DEAL (Detecting Entities in the Astrophysics
Literature) shared task. The aim of the task is to build a system that can
identify Named Entities in a dataset composed by scholarly articles from
astrophysics literature. We planned our participation such that it enables us
to conduct an empirical comparison between word-based tagging and span-based
classification methods. When evaluated on two hidden test sets provided by the
organizer, our best-performing submission achieved $F_1$ scores of 0.8307
(validation phase) and 0.7990 (testing phase).
Related papers
- Science Hierarchography: Hierarchical Organization of Science Literature [20.182213614072836]
We motivate SCIENCE HARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure.
We develop a range of algorithms to achieve the goals of SCIENCE HIERARCHOGRAPHY.
Results show that this structured approach enhances interpretability, supports trend discovery, and offers an alternative pathway for exploring scientific literature beyond traditional search methods.
arXiv Detail & Related papers (2025-04-18T17:59:29Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy [2.6952253149772996]
Pathfinder is a machine learning framework designed to enable literature review and knowledge discovery in astronomy.
Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context.
It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes.
arXiv Detail & Related papers (2024-08-02T20:05:24Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - Persian Homograph Disambiguation: Leveraging ParsBERT for Enhanced Sentence Understanding with a Novel Word Disambiguation Dataset [0.0]
We introduce a novel dataset tailored for Persian homograph disambiguation.
Our work encompasses a thorough exploration of various embeddings, evaluated through the cosine similarity method.
We scrutinize the models' performance in terms of Accuracy, Recall, and F1 Score.
arXiv Detail & Related papers (2024-05-24T14:56:36Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - TERMinator: A system for scientific texts processing [0.0]
This paper is devoted to the extraction of entities and semantic relations between them from scientific texts.
We present a dataset that includes annotations for two tasks and develop a system called TERMinator for the study of the influence of language models on term recognition.
arXiv Detail & Related papers (2022-09-29T15:14:42Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Semantic Analysis for Automated Evaluation of the Potential Impact of
Research Articles [62.997667081978825]
This paper presents a novel method for vector representation of text meaning based on information theory.
We show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus.
We show that an informational approach to representing the meaning of a text has offered a way to effectively predict the scientific impact of research papers.
arXiv Detail & Related papers (2021-04-26T20:37:13Z) - On the Impact of Knowledge-based Linguistic Annotations in the Quality
of Scientific Embeddings [0.0]
We conduct a study on the use of explicit linguistic annotations to generate embeddings from a scientific corpus.
Our results show how the effect of such annotations in the embeddings varies depending on the evaluation task.
In general, we observe that learning embeddings using linguistic annotations contributes to achieve better evaluation results.
arXiv Detail & Related papers (2021-04-13T13:51:22Z) - The STEM-ECR Dataset: Grounding Scientific Entity References in STEM
Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources [8.54082916181163]
The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks.
It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform.
arXiv Detail & Related papers (2020-03-02T16:35:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.