Related papers: Advancing Manuscript Metadata: Work in Progress at the Jagiellonian University

Related papers

Metadata Enrichment of Long Text Documents using Large Language Models [3.536523762475449]
In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020.<n>This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science.
arXiv Detail & Related papers (2025-06-26T00:55:47Z)
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation [61.130639734982395]
We introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image.<n>Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.
arXiv Detail & Related papers (2025-06-10T17:16:23Z)
Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications [8.732274235941974]
We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections.<n>MEM combines fine-tuned computer vision models, large language models and structured knowledge graphs.<n>We release a dataset of digitized incunabula from the Jagiellonian Digital Library.
arXiv Detail & Related papers (2025-05-29T15:22:18Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
A Generative AI-driven Metadata Modelling Approach [1.450405446885067]
This paper proposes a Generative AI-driven Human-Large Language Model (LLM) collaboration based metadata modelling approach to disentangle the entanglement inherent in each representation level leading to the generation of a conceptually disentangled metadata model.
arXiv Detail & Related papers (2024-12-13T09:26:04Z)
Web Archives Metadata Generation with GPT-4o: Challenges and Insights [2.45723043286596]
This paper explores the use ofgpt-4o for metadata generation within the Web Singapore Archive.<n>We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs.<n>The study identifies key challenges including content inaccuracies, hallucinations, and translation issues, suggesting that Large Language Models (LLMs) should serve as complements rather than replacements for human cataloguers.
arXiv Detail & Related papers (2024-11-08T08:59:40Z)
Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z)
Microsoft Cloud-based Digitization Workflow with Rich Metadata Acquisition for Cultural Heritage Objects [7.450700594277742]
We have developed a new digitization workflow with the Jagiellonian Library (JL) The solution is based on easy-to-access technological solutions -- Microsoft cloud with MS Excel files interfaces, Office Script for metadata acquisition, MS 365 for storage -- that allows metadata acquisition by domain experts. The ultimate goal is to create a knowledge graph that describes the analyzed holdings, linked to general knowledge bases, as well as to other cultural heritage collections.
arXiv Detail & Related papers (2024-07-09T15:49:47Z)
EUFCC-340K: A Faceted Hierarchical Dataset for Metadata Annotation in GLAM Collections [6.723689308768857]
The EUFCC340K dataset is organized across multiple facets: Materials, Object Types, Disciplines, and Subjects, following a hierarchical structure based on the Art & Architecture Thesaurus (AAT) Our experiments to evaluate model robustness and generalization capabilities in two different test scenarios demonstrate the utility of the dataset in improving multi-label classification tools.
arXiv Detail & Related papers (2024-06-04T14:57:56Z)
Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection. We abstract over arbitrary header paraphrases, and ground each topic to respective document locations. We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z)
Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B. We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively. We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z)
On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data. There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z)
Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. This novel multi-document QA formulation directs the model to better recover cross-text informational relations. Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z)
DAC-MR: Data Augmentation Consistency Based Meta-Regularization for Meta-Learning [55.733193075728096]
We propose a meta-knowledge informed meta-learning (MKIML) framework to improve meta-learning. We preliminarily integrate meta-knowledge into meta-objective via using an appropriate meta-regularization (MR) objective. The proposed DAC-MR is hopeful to learn well-performing meta-models from training tasks with noisy, sparse or unavailable meta-data.
arXiv Detail & Related papers (2023-05-13T11:01:47Z)
Documenting Data Production Processes: A Participatory Approach for Data Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has proposed standardized checklists to document datasets. This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z)
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks. We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources [17.69148305999049]
We present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identify a geographically diverse set of target language groups for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons.
arXiv Detail & Related papers (2022-01-25T03:05:23Z)
Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model. Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z)
What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work. We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels. We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.