Metadata Enrichment of Long Text Documents using Large Language Models
- URL: http://arxiv.org/abs/2506.20918v1
- Date: Thu, 26 Jun 2025 00:55:47 GMT
- Title: Metadata Enrichment of Long Text Documents using Large Language Models
- Authors: Manika Lamba, You Peng, Sophie Nikolov, Glen Layne-Worthey, J. Stephen Downie,
- Abstract summary: In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020.<n>This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science.
- Score: 3.536523762475449
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.
Related papers
- Knowledge Graphs for Digitized Manuscripts in Jagiellonian Digital Library Application [8.732274235941974]
Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections.<n>These collections are often enriched with metadata describing items but not exactly their contents.<n>We explore an integrated methodology of computer vision (CV), artificial intelligence (AI) and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.
arXiv Detail & Related papers (2025-05-29T14:49:24Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents [8.516310581591426]
This study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance.<n>We aim to improve the accessibility of scientific documents and facilitate their wider use.
arXiv Detail & Related papers (2025-01-09T09:03:43Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Utilising a Large Language Model to Annotate Subject Metadata: A Case
Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research.
As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them.
This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - Documenting Geographically and Contextually Diverse Data Sources: The
BigScience Catalogue of Language Data and Resources [17.69148305999049]
We present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative.
We identify a geographically diverse set of target language groups for which to collect metadata on potential data sources.
To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons.
arXiv Detail & Related papers (2022-01-25T03:05:23Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.