CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data
- URL: http://arxiv.org/abs/2304.14953v2
- Date: Tue, 6 Jun 2023 07:35:17 GMT
- Title: CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data
- Authors: Micha{\l} Turski, Tomasz Stanis{\l}awek, Karol Kaczmarek, Pawe{\l}
Dyda, and Filip Grali\'nski
- Abstract summary: This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl.
We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
- Score: 2.7843134136364265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the field of document understanding has progressed a lot. A
significant part of this progress has been possible thanks to the use of
language models pretrained on large amounts of documents. However, pretraining
corpora used in the domain of document understanding are single domain,
monolingual, or nonpublic. Our goal in this paper is to propose an efficient
pipeline for creating a big-scale, diverse, multilingual corpus of PDF files
from all over the Internet using Common Crawl, as PDF files are the most
canonical types of documents as considered in document understanding. We
analysed extensively all of the steps of the pipeline and proposed a solution
which is a trade-off between data quality and processing time. We also share a
CCpdf corpus in a form or an index of PDF files along with a script for
downloading them, which produces a collection useful for language model
pretraining. The dataset and tools published with this paper offer researchers
the opportunity to develop even better multilingual language models.
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Towards a Cleaner Document-Oriented Multilingual Crawled Corpus [2.1028463367241033]
This paper takes the existing multilingual web corpus OSCAR and its pipeline Ungoliant and extracts and classifies data from Common Crawl at the line level.
We propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models.
arXiv Detail & Related papers (2022-01-17T22:12:59Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Robust PDF Document Conversion Using Recurrent Neural Networks [0.0]
We present a novel approach to document structure recovery in PDF using recurrent neural networks.
We show how a sequence of PDF printing commands can be used as input into a neural network.
We implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels.
arXiv Detail & Related papers (2021-02-18T14:39:54Z) - Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies [0.0]
This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora.
The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels.
Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
arXiv Detail & Related papers (2020-12-15T10:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.