Flexible metadata harvesting for ecology using large language models
- URL: http://arxiv.org/abs/2508.20115v1
- Date: Thu, 21 Aug 2025 10:10:29 GMT
- Title: Flexible metadata harvesting for ecology using large language models
- Authors: Zehao Lu, Thijs L van der Plas, Parinaz Rashidi, W Daniel Kissling, Ioannis N Athanasiadis,
- Abstract summary: We develop a large language model (LLM)-based metadata harvester.<n>It flexibly extracts metadata from any dataset's landing page.<n>It converts these to a user-defined, unified format using existing metadata standards.
- Score: 3.4117490081172774
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large, open datasets can accelerate ecological research, particularly by enabling researchers to develop new insights by reusing datasets from multiple sources. However, to find the most suitable datasets to combine and integrate, researchers must navigate diverse ecological and environmental data provider platforms with varying metadata availability and standards. To overcome this obstacle, we have developed a large language model (LLM)-based metadata harvester that flexibly extracts metadata from any dataset's landing page, and converts these to a user-defined, unified format using existing metadata standards. We validate that our tool is able to extract both structured and unstructured metadata with equal accuracy, aided by our LLM post-processing protocol. Furthermore, we utilise LLMs to identify links between datasets, both by calculating embedding similarity and by unifying the formats of extracted metadata to enable rule-based processing. Our tool, which flexibly links the metadata of different datasets, can therefore be used for ontology creation or graph-based queries, for example, to find relevant ecological and environmental datasets in a virtual research environment.
Related papers
- Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - Keywords are not always the key: A metadata field analysis for natural language search on open data portals [3.974422712382188]
We examine how individual metadata fields affect the success of conversational dataset retrieval.<n>We compare existing content of the metadata field 'description' with LLM-generated content.<n>Our findings suggest that dataset descriptions play a central role in aligning with user intent.
arXiv Detail & Related papers (2025-09-17T22:14:27Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - Making Sense of Data in the Wild: Data Analysis Automation at Scale [0.1747623282473278]
We propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale.<n>We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks.
arXiv Detail & Related papers (2025-01-27T10:04:10Z) - Harmonizing Metadata of Language Resources for Enhanced Querying and Accessibility [0.0]
This paper addresses the harmonization of metadata from diverse repositories of language resources (LRs)<n>Our methodology supports text-based search, faceted browsing, and advanced SPARQL queries through Linghub, a newly developed portal.<n>The study highlights significant metadata issues and advocates for adherence to open vocabularies and standards to enhance metadata harmonization.
arXiv Detail & Related papers (2025-01-09T22:48:43Z) - Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery.
The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z) - Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Making Metadata More FAIR Using Large Language Models [2.61630828688114]
This work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata.
Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms.
This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic.
arXiv Detail & Related papers (2023-07-24T19:14:38Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Metadata Representations for Queryable ML Model Zoos [73.24799582702326]
Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the models.
The metatada is currently not standardised; its expressivity is limited; and there is no way to store and query it.
In this paper, we advocate for standardized ML model meta-data representation and management, proposing a toolkit supported to help practitioners manage and query that metadata.
arXiv Detail & Related papers (2022-07-19T15:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.