Related papers: SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

URL: http://arxiv.org/abs/2503.11576v1
Date: Fri, 14 Mar 2025 16:44:14 GMT
Title: SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Authors: Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, Miquel Farré, Peter W. J. Staar,
Abstract summary: We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion.<n>Our model comprehensively processes entire pages by generating DocTags, a new universal markup format.<n>SmohDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more.
Score: 9.198920557312865
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.

Related papers

DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding [59.4112754806335]
We propose DocLens, a tool-augmented multi-agent framework that effectively zooms in'' on evidence like a lens.<n>It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer.<n>It achieves state-of-the-art performance on MMLongBench-Doc and FinRAG-V, surpassing even human experts.
arXiv Detail & Related papers (2025-11-14T18:42:18Z)
MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding [7.650139800950797]
MosaicDoc is a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of Visually Rich Document Understanding (VRDU)<n>With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field.<n>Our evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity.
arXiv Detail & Related papers (2025-11-13T03:34:44Z)
FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models [4.013756026582041]
Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets.<n>We introduce FlexDoc, a scalable synthetic data generation framework.<n>We show that FlexDoc improves the absolute F1 Score by up to 11% when used to augment real datasets.
arXiv Detail & Related papers (2025-10-02T15:42:35Z)
DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral [11.336757553731639]
Acquiring structured data from domain-specific, image-based documents is crucial for many downstream tasks.<n>Many documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems.<n>We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform.
arXiv Detail & Related papers (2025-05-06T06:02:42Z)
Relation-Rich Visual Document Generator for Visual Information Extraction [12.4941229258054]
We propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach. Our method significantly enhances the performance of document understanding models on various VIE benchmarks.
arXiv Detail & Related papers (2025-04-14T19:19:26Z)
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document. Existing document understanding benchmarks often assess LVLMs using question-answer formats. We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench) M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z)
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence [88.74800617923083]
We introduce Granite Vision, a lightweight large language model with vision capabilities.<n>Our model is trained on a comprehensive instruction-following dataset.<n> Granite Vision achieves strong results in standard benchmarks related to visual document understanding.
arXiv Detail & Related papers (2025-02-14T05:36:32Z)
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data. Document parsing plays an indispensable role in both knowledge base construction and training data generation. This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z)
DocLLM: A layout-aware generative language model for multimodal document understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents. Our model focuses exclusively on bounding box information to incorporate the spatial layout structure. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z)
A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX. We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification. Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z)
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z)
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model. XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z)
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [2.9923891863939938]
Document layout analysis is a key requirement for high-quality PDF document conversion. Deep-learning models have proven to be very effective at layout detection and segmentation. We present textitDocLayNet, a new, publicly available, document- annotation dataset.
arXiv Detail & Related papers (2022-06-02T14:25:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.