Automated Annotation of Scientific Texts for ML-based Keyphrase
Extraction and Validation
- URL: http://arxiv.org/abs/2311.05042v1
- Date: Wed, 8 Nov 2023 22:09:31 GMT
- Title: Automated Annotation of Scientific Texts for ML-based Keyphrase
Extraction and Validation
- Authors: Oluwamayowa O. Amusat, Harshad Hegde, Christopher J. Mungall, Anna
Giannakou, Neil P. Byers, Dan Gunter, Kjiersten Fagnan and Lavanya
Ramakrishnan
- Abstract summary: We present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts.
Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced omics technologies and facilities generate a wealth of valuable data
daily; however, the data often lacks the essential metadata required for
researchers to find and search them effectively. The lack of metadata poses a
significant challenge in the utilization of these datasets. Machine
learning-based metadata extraction techniques have emerged as a potentially
viable approach to automatically annotating scientific datasets with the
metadata necessary for enabling effective search. Text labeling, usually
performed manually, plays a crucial role in validating machine-extracted
metadata. However, manual labeling is time-consuming; thus, there is an need to
develop automated text labeling techniques in order to accelerate the process
of scientific innovation. This need is particularly urgent in fields such as
environmental genomics and microbiome science, which have historically received
less attention in terms of metadata curation and creation of gold-standard text
mining datasets.
In this paper, we present two novel automated text labeling approaches for
the validation of ML-generated metadata for unlabeled texts, with specific
applications in environmental genomics. Our techniques show the potential of
two new ways to leverage existing information about the unlabeled texts and the
scientific domain. The first technique exploits relationships between different
types of data sources related to the same research study, such as publications
and proposals. The second technique takes advantage of domain-specific
controlled vocabularies or ontologies. In this paper, we detail applying these
approaches for ML-generated metadata validation. Our results show that the
proposed label assignment approaches can generate both generic and
highly-specific text labels for the unlabeled texts, with up to 44% of the
labels matching with those suggested by a ML keyword extraction algorithm.
Related papers
- TnT-LLM: Text Mining at Scale with Large Language Models [24.731544646232962]
Large Language Models (LLMs) automate the process of end-to-end label generation and assignment with minimal human effort.
We show that TnT-LLM generates more accurate and relevant label when compared against state-of-the-art baselines.
We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
arXiv Detail & Related papers (2024-03-18T18:45:28Z) - Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z) - The Effect of Metadata on Scientific Literature Tagging: A Cross-Field
Cross-Model Study [29.965010251365946]
We systematically study the effect of metadata on scientific literature tagging across 19 fields.
We observe some ubiquitous patterns of metadata's effects across all fields.
arXiv Detail & Related papers (2023-02-07T09:34:41Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Investigation on Data Adaptation Techniques for Neural Named Entity
Recognition [51.88382864759973]
A common practice is to utilize large monolingual unlabeled corpora.
Another popular technique is to create synthetic data from the original labeled data.
In this work, we investigate the impact of these two methods on the performance of three different named entity recognition tasks.
arXiv Detail & Related papers (2021-10-12T11:06:03Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z) - GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction [1.0681288493631977]
We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction.
GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases.
It refines the most significant phrases as a final set of keyphrases.
arXiv Detail & Related papers (2020-05-19T20:24:02Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.