BookReconciler: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering
- URL: http://arxiv.org/abs/2512.10165v2
- Date: Wed, 17 Dec 2025 03:07:21 GMT
- Title: BookReconciler: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering
- Authors: Matt Miller, Dan Sinykin, Melanie Walsh,
- Abstract summary: BookReconciler is an open-source tool for enhancing and clustering book data.<n>BookReconciler clustering achieves near-perfect accuracy for U.S. works but lower performance for global texts.<n>BookReconciler supports the reuse of data across domains and applications.
- Score: 0.7282934840192818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present BookReconciler, an open-source tool for enhancing and clustering book data. BookReconciler allows users to take spreadsheets with minimal metadata, such as book title and author, and automatically 1) add authoritative, persistent identifiers like ISBNs 2) and cluster related Expressions and Manifestations of the same Work, e.g., different translations or editions. This enhancement makes it easier to combine related collections and analyze books at scale. The tool is currently designed as an extension for OpenRefine -- a popular software application -- and connects to major bibliographic services including the Library of Congress, VIAF, OCLC, HathiTrust, Google Books, and Wikidata. Our approach prioritizes human judgment. Through an interactive interface, users can manually evaluate matches and define the contours of a Work (e.g., to include translations or not). We evaluate reconciliation performance on datasets of U.S. prize-winning books and contemporary world fiction. BookReconciler achieves near-perfect accuracy for U.S. works but lower performance for global texts, reflecting structural weaknesses in bibliographic infrastructures for non-English and global literature. Overall, BookReconciler supports the reuse of bibliographic data across domains and applications, contributing to ongoing work in digital libraries and digital humanities.
Related papers
- MajinBook: An open catalogue of digital world literature with likes [2.6547708221528987]
MajinBook is an open catalogue designed to facilitate use of shadow libraries.<n>We create a high-precision corpus of over 539,000 references to English-language books spanning three centuries.
arXiv Detail & Related papers (2025-11-14T15:44:27Z) - ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links [57.514511353084565]
We introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links.<n>We apply our framework in two distinct domains -- peer review and news.<n>The resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review.
arXiv Detail & Related papers (2025-09-01T11:32:24Z) - Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs [3.68389405018277]
This demo paper reports on a new workflow textitGhostWriter that combines the use of Large Language Models and Knowledge Graphs to support navigation through collections.<n>Based on the tool-suite textitEverythingData at the backend, textitGhostWriter provides an interface that enables querying and chatting'' with a collection.
arXiv Detail & Related papers (2025-05-16T18:51:51Z) - Generative Retrieval for Book search [106.67655212825025]
We propose an effective Generative retrieval framework for Book Search.<n>It features two main components: data augmentation and outline-oriented book encoding.<n>Experiments on a proprietary Baidu dataset demonstrate that GBS outperforms strong baselines.
arXiv Detail & Related papers (2025-01-19T12:57:13Z) - Image-text matching for large-scale book collections [10.444851303425589]
We address the problem of detecting and mapping all books in a collection of images to entries in a given book catalogue.
We combine a state-of-the-art segmentation method (SAM) to detect book spines and extract book information using a commercial OCR.
To evaluate our approach, we publish a new dataset of annotated bookshelf images that covers the whole book collection of a public library in Spain.
arXiv Detail & Related papers (2024-07-29T09:05:04Z) - AI Literature Review Suite [0.0]
I present an AI Literature Review Suite that integrates several functionalities to provide a comprehensive literature review.
This tool leverages the power of open access science, large language models (LLMs) and natural language processing to enable the searching, downloading, and organizing of PDF files.
The suite also features integrated programs for organization, interaction and query, and literature review summaries.
arXiv Detail & Related papers (2023-07-27T17:30:31Z) - LAVIS: A Library for Language-Vision Intelligence [98.88477610704938]
LAVIS is an open-source library for LAnguage-VISion research and applications.
It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets.
arXiv Detail & Related papers (2022-09-15T18:04:10Z) - The Fellowship of the Authors: Disambiguating Names from Social Network
Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities.
We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods.
We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics [74.28810048824519]
SacreROUGE is an open-source library for using and developing summarization evaluation metrics.
The library provides Python wrappers around the official implementations of existing evaluation metrics.
It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
arXiv Detail & Related papers (2020-07-10T13:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.