Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
- URL: http://arxiv.org/abs/2501.17887v1
- Date: Mon, 27 Jan 2025 19:40:00 GMT
- Title: Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
- Authors: Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, Peter W. J. Staar,
- Abstract summary: Docling is an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion.
It can parse several types of popular document formats into a unified, richly structured representation.
Docling is released as a Python package and can be used as a Python API or as a CLI tool.
- Score: 20.44433450426808
- License:
- Abstract: We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. Docling is released as a Python package and can be used as a Python API or as a CLI tool. Docling's modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. Docling has been already integrated in other popular open-source frameworks (e.g., LangChain, LlamaIndex, spaCy), making it a natural fit for the processing of documents and the development of high-end applications. The open-source community has fully engaged in using, promoting, and developing for Docling, which gathered 10k stars on GitHub in less than a month and was reported as the No. 1 trending repository in GitHub worldwide in November 2024.
Related papers
- BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks [55.61185100263898]
We introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks.
We also introduce BigDocs-Bench, a benchmark suite with 10 novel tasks.
Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o.
arXiv Detail & Related papers (2024-12-05T21:41:20Z) - Docling Technical Report [19.80268711310715]
Docling is an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion.
It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer)
arXiv Detail & Related papers (2024-08-19T10:20:06Z) - pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules.
We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z) - DocXChain: A Powerful Open-Source Toolchain for Document Parsing and
Beyond [17.853066545805554]
DocXChain is a powerful open-source toolchain for document parsing.
It automatically converts the rich information embodied in unstructured documents, such as text, tables and charts, into structured representations.
arXiv Detail & Related papers (2023-10-19T02:49:09Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - You Only Write Thrice: Creating Documents, Computational Notebooks and
Presentations From a Single Source [11.472707084860875]
Academic trade requires juggling multiple variants of the same content published in different formats.
We propose to significantly reduce this burden by maintaining a single source document in a version-controlled environment.
We offer a proof-of-concept workflow that composes Jupyter Book (an online document), Jupyter Notebook (a computational narrative) and reveal.js slides from a single markdown source file.
arXiv Detail & Related papers (2021-07-02T21:02:09Z) - Doc2Dict: Information Extraction as Text Generation [0.0]
Doc2Dict is a pipeline for extracting document-level information.
We train a language model on existing database records to generate structured spans.
We use checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single baseline.
arXiv Detail & Related papers (2021-05-16T20:46:29Z) - DocOIE: A Document-level Context-Aware Dataset for OpenIE [22.544165148622422]
Open Information Extraction (OpenIE) aims to extract structured relationals from sentences.
Existing solutions perform extraction at sentence level, without referring to any additional contextual information.
We propose DocIE, a novel document-level context-aware OpenIE model.
arXiv Detail & Related papers (2021-05-10T11:14:30Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.