Making Metadata More FAIR Using Large Language Models
        - URL: http://arxiv.org/abs/2307.13085v1
- Date: Mon, 24 Jul 2023 19:14:38 GMT
- Title: Making Metadata More FAIR Using Large Language Models
- Authors: Sowmya S. Sundaram, Mark A. Musen
- Abstract summary: This work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata.
Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms.
This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic.
- Score: 2.61630828688114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   With the global increase in experimental data artifacts, harnessing them in a
unified fashion leads to a major stumbling block - bad metadata. To bridge this
gap, this work presents a Natural Language Processing (NLP) informed
application, called FAIRMetaText, that compares metadata. Specifically,
FAIRMetaText analyzes the natural language descriptions of metadata and
provides a mathematical similarity measure between two terms. This measure can
then be utilized for analyzing varied metadata, by suggesting terms for
compliance or grouping similar terms for identification of replaceable terms.
The efficacy of the algorithm is presented qualitatively and quantitatively on
publicly available research artifacts and demonstrates large gains across
metadata related tasks through an in-depth study of a wide variety of Large
Language Models (LLMs). This software can drastically reduce the human effort
in sifting through various natural language metadata while employing several
experimental datasets on the same topic.
 
      
        Related papers
        - Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
 We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
 arXiv  Detail & Related papers  (2025-07-24T11:28:53Z)
- Metaphor and Large Language Models: When Surface Features Matter More   than Deep Understanding [6.0158981171030685]
 This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations.<n>We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations.<n>The results indicate that LLMs' performance is more influenced by features like lexical overlap and sentence length than by metaphorical content.
 arXiv  Detail & Related papers  (2025-07-21T08:09:11Z)
- MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
 MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
 arXiv  Detail & Related papers  (2025-05-26T10:31:26Z)
- Comparison of Feature Learning Methods for Metadata Extraction from PDF   Scholarly Documents [8.516310581591426]
 This study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance.
We aim to improve the accessibility of scientific documents and facilitate their wider use.
 arXiv  Detail & Related papers  (2025-01-09T09:03:43Z)
- Scholar Name Disambiguation with Search-enhanced LLM Across Language [0.2302001830524133]
 This paper proposes a novel approach by leveraging search-enhanced language models across multiple languages to improve name disambiguation.
By utilizing the powerful query rewriting, intent recognition, and data indexing capabilities of search engines, our method can gather richer information for distinguishing between entities and extracting profiles.
 arXiv  Detail & Related papers  (2024-11-26T04:39:46Z)
- Boosting the Capabilities of Compact Models in Low-Data Contexts with   Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
 We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
 arXiv  Detail & Related papers  (2024-10-01T04:20:14Z)
- Img-Diff: Contrastive Data Synthesis for Multimodal Large Language   Models [49.439311430360284]
 We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
 arXiv  Detail & Related papers  (2024-08-08T17:10:16Z)
- ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused   with High-Quality Lexical and Syntactic Diversity [0.0]
 Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences.
This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM)
ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity.
 arXiv  Detail & Related papers  (2024-04-18T09:02:45Z)
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
  Image-Dialogue Data [129.92449761766025]
 We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
 arXiv  Detail & Related papers  (2023-08-20T12:43:52Z)
- An Iterative Optimizing Framework for Radiology Report Summarization   with ChatGPT [80.33783969507458]
 The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
 arXiv  Detail & Related papers  (2023-04-17T17:13:42Z)
- Always Keep your Target in Mind: Studying Semantics and Improving
  Performance of Neural Lexical Substitution [124.99894592871385]
 We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
 arXiv  Detail & Related papers  (2022-06-07T16:16:19Z)
- Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating
  Toxic Text Datasets [26.486492641924226]
 This study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues.
We re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality.
 arXiv  Detail & Related papers  (2021-12-07T06:58:22Z)
- Multimodal Approach for Metadata Extraction from German Scientific
  Publications [0.0]
 We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
 arXiv  Detail & Related papers  (2021-11-10T15:19:04Z)
- Improving Classifier Training Efficiency for Automatic Cyberbullying
  Detection with Feature Density [58.64907136562178]
 We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
 arXiv  Detail & Related papers  (2021-11-02T15:48:28Z)
- Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
 We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
 arXiv  Detail & Related papers  (2020-11-13T10:53:27Z)
- A Comparative Study of Lexical Substitution Approaches based on Neural
  Language Models [117.96628873753123]
 We present a large-scale comparative study of popular neural language and masked language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly.
 arXiv  Detail & Related papers  (2020-05-29T18:43:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.