MajinBook: An open catalogue of digital world literature with likes
- URL: http://arxiv.org/abs/2511.11412v3
- Date: Thu, 20 Nov 2025 07:27:10 GMT
- Title: MajinBook: An open catalogue of digital world literature with likes
- Authors: Antoine Mazières, Thierry Poibeau,
- Abstract summary: MajinBook is an open catalogue designed to facilitate use of shadow libraries.<n>We create a high-precision corpus of over 539,000 references to English-language books spanning three centuries.
- Score: 2.6547708221528987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
Related papers
- BookReconciler: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering [0.7282934840192818]
BookReconciler is an open-source tool for enhancing and clustering book data.<n>BookReconciler clustering achieves near-perfect accuracy for U.S. works but lower performance for global texts.<n>BookReconciler supports the reuse of data across domains and applications.
arXiv Detail & Related papers (2025-12-10T23:51:55Z) - Metadata Enrichment of Long Text Documents using Large Language Models [3.536523762475449]
In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020.<n>This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science.
arXiv Detail & Related papers (2025-06-26T00:55:47Z) - Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability [1.3281177137699656]
This report introduces Institutional Books 1.0, a collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006.<n>Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts.<n>This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens.
arXiv Detail & Related papers (2025-06-10T00:11:30Z) - Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers [7.161822501147275]
This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images.
It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection.
The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
arXiv Detail & Related papers (2023-08-24T00:24:42Z) - The Semantic Scholar Open Data Platform [92.2948743167744]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.<n>We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.<n>The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Documenting Geographically and Contextually Diverse Data Sources: The
BigScience Catalogue of Language Data and Resources [17.69148305999049]
We present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative.
We identify a geographically diverse set of target language groups for which to collect metadata on potential data sources.
To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons.
arXiv Detail & Related papers (2022-01-25T03:05:23Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - \textit{StateCensusLaws.org}: A Web Application for Consuming and
Annotating Legal Discourse Learning [89.77347919191774]
We create a web application to highlight the output of NLP models trained to parse and label discourse segments in law text.
We focus on state-level law that uses U.S. Census population numbers to allocate resources and organize government.
arXiv Detail & Related papers (2021-04-20T22:00:54Z) - SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics [74.28810048824519]
SacreROUGE is an open-source library for using and developing summarization evaluation metrics.
The library provides Python wrappers around the official implementations of existing evaluation metrics.
It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
arXiv Detail & Related papers (2020-07-10T13:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.