MetaEnhance: Metadata Quality Improvement for Electronic Theses and
Dissertations of University Libraries
- URL: http://arxiv.org/abs/2303.17661v1
- Date: Thu, 30 Mar 2023 18:56:42 GMT
- Title: MetaEnhance: Metadata Quality Improvement for Electronic Theses and
Dissertations of University Libraries
- Authors: Muntabir Hasan Choudhury, Lamia Salsabil, Himarsha R. Jayanetti, Jian
Wu, William A. Ingram, Edward A. Fox
- Abstract summary: We investigate methods to automatically detect, correct, and canonicalize scholarly metadata.
We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields.
- Score: 3.5761273302956282
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Metadata quality is crucial for digital objects to be discovered through
digital library interfaces. However, due to various reasons, the metadata of
digital objects often exhibits incomplete, inconsistent, and incorrect values.
We investigate methods to automatically detect, correct, and canonicalize
scholarly metadata, using seven key fields of electronic theses and
dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that
utilizes state-of-the-art artificial intelligence methods to improve the
quality of these fields. To evaluate MetaEnhance, we compiled a metadata
quality evaluation benchmark containing 500 ETDs, by combining subsets sampled
using multiple criteria. We tested MetaEnhance on this benchmark and found that
the proposed methods achieved nearly perfect F1-scores in detecting errors and
F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven
fields.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models [2.186740861187042]
Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets.
This paper investigates the potential of large language models (LLMs) to improve adherence to metadata standards.
We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository.
arXiv Detail & Related papers (2024-04-08T22:29:53Z) - Enhanced Meta Label Correction for Coping with Label Corruption [3.6804038214708577]
We propose an Enhanced Meta Label Correction approach abbreviated asC for the learning with noisy labels problem.
TraditionalC outperforms prior approaches and achieves state-of-the-art results in all standard benchmarks.
arXiv Detail & Related papers (2023-05-22T12:11:07Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Using Rater and System Metadata to Explain Variance in the VoiceMOS
Challenge 2022 Dataset [71.93633698146002]
The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels.
This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset.
arXiv Detail & Related papers (2022-09-14T00:45:49Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Automatic Metadata Extraction Incorporating Visual Features from Scanned
Electronic Theses and Dissertations [3.1354625918296612]
Electronic Theses and (ETDs) contain domain knowledge that can be used for many digital library tasks.
Traditional sequence tagging methods mainly rely on text-based features.
We propose a conditional random field (CRF) model that combines text-based and visual features.
arXiv Detail & Related papers (2021-07-01T14:59:18Z) - Noise-resistant Deep Metric Learning with Ranking-based Instance
Selection [59.286567680389766]
We propose a noise-resistant training technique for DML, which we name Probabilistic Ranking-based Instance Selection with Memory (PRISM)
PRISM identifies noisy data in a minibatch using average similarity against image features extracted from several previous versions of the neural network.
To alleviate the high computational cost brought by the memory bank, we introduce an acceleration method that replaces individual data points with the class centers.
arXiv Detail & Related papers (2021-03-30T03:22:17Z) - Quality Prediction of Open Educational Resources A Metadata-based
Approach [0.0]
Metadata play a key role in offering high quality services such as recommendation and search.
We propose an OER metadata scoring model, and build a metadata-based prediction model to anticipate the quality of OERs.
Based on our data and model, we were able to detect high-quality OERs with the F1 score of 94.6%.
arXiv Detail & Related papers (2020-05-21T09:53:43Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.