Related papers: olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

URL: http://arxiv.org/abs/2502.18443v1
Date: Tue, 25 Feb 2025 18:38:38 GMT
Title: olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Authors: Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini,
Abstract summary: olmOCR is an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order.<n>Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs.<n> olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD.
Score: 17.018144344175973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks including vLLM and SGLang.

Related papers

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Our dataset has 15 times larger scales while maintaining good data quality. We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z)
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions. By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z)
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package. It meets both the computational and sample efficiency requirements for liberating texts at scale. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z)
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM) By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters. Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z)
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z)
Robust PDF Document Conversion Using Recurrent Neural Networks [0.0]
We present a novel approach to document structure recovery in PDF using recurrent neural networks. We show how a sequence of PDF printing commands can be used as input into a neural network. We implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels.
arXiv Detail & Related papers (2021-02-18T14:39:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.