Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability
- URL: http://arxiv.org/abs/2506.08300v1
- Date: Tue, 10 Jun 2025 00:11:30 GMT
- Title: Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability
- Authors: Matteo Cargnelutti, Catherine Brobston, John Hess, Jack Cushman, Kristi Mukk, Aristana Scourtas, Kyle Courtney, Greg Leppert, Amanda Watson, Martha Whitehead, Jonathan Zittrain,
- Abstract summary: This report introduces Institutional Books 1.0, a collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006.<n>Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts.<n>This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens.
- Score: 1.3281177137699656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.
Related papers
- Metadata Enrichment of Long Text Documents using Large Language Models [3.536523762475449]
In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020.<n>This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science.
arXiv Detail & Related papers (2025-06-26T00:55:47Z) - Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training [6.00143998001152]
We introduce Common Corpus, the largest open dataset for language model pre-training.<n>The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets.
arXiv Detail & Related papers (2025-06-02T14:43:15Z) - Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora [2.3251886193174114]
We present an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning.<n>Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material.<n>We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset.
arXiv Detail & Related papers (2025-02-19T13:03:06Z) - Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM [47.64519989743434]
Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model.<n>This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way.
arXiv Detail & Related papers (2025-02-10T16:31:37Z) - Insights from Publishing Open Data in Industry-Academia Collaboration [3.458783333044753]
This paper explores the motivations and lessons learned from publishing open data sets in such collaborations.<n>We surveyed participants in a European research project that published 13 data sets.<n>We found that planning the data collection is essential, and that only few datasets had accompanying scripts for improved reuse.
arXiv Detail & Related papers (2025-01-24T07:30:46Z) - A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.