Making Metadata More FAIR Using Large Language Models
- URL: http://arxiv.org/abs/2307.13085v1
- Date: Mon, 24 Jul 2023 19:14:38 GMT
- Title: Making Metadata More FAIR Using Large Language Models
- Authors: Sowmya S. Sundaram, Mark A. Musen
- Abstract summary: This work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata.
Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms.
This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic.
- Score: 2.61630828688114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the global increase in experimental data artifacts, harnessing them in a
unified fashion leads to a major stumbling block - bad metadata. To bridge this
gap, this work presents a Natural Language Processing (NLP) informed
application, called FAIRMetaText, that compares metadata. Specifically,
FAIRMetaText analyzes the natural language descriptions of metadata and
provides a mathematical similarity measure between two terms. This measure can
then be utilized for analyzing varied metadata, by suggesting terms for
compliance or grouping similar terms for identification of replaceable terms.
The efficacy of the algorithm is presented qualitatively and quantitatively on
publicly available research artifacts and demonstrates large gains across
metadata related tasks through an in-depth study of a wide variety of Large
Language Models (LLMs). This software can drastically reduce the human effort
in sifting through various natural language metadata while employing several
experimental datasets on the same topic.
Related papers
- Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity [0.0]
Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences.
This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM)
ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity.
arXiv Detail & Related papers (2024-04-18T09:02:45Z) - Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating
Toxic Text Datasets [26.486492641924226]
This study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues.
We re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality.
arXiv Detail & Related papers (2021-12-07T06:58:22Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - A Comparative Study of Lexical Substitution Approaches based on Neural
Language Models [117.96628873753123]
We present a large-scale comparative study of popular neural language and masked language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly.
arXiv Detail & Related papers (2020-05-29T18:43:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.