Arctic-TILT. Business Document Understanding at Sub-Billion Scale
- URL: http://arxiv.org/abs/2408.04632v1
- Date: Thu, 8 Aug 2024 17:59:46 GMT
- Title: Arctic-TILT. Business Document Understanding at Sub-Billion Scale
- Authors: Łukasz Borchmann, Michał Pietruszka, Wojciech Jaśkowski, Dawid Jurkiewicz, Piotr Halama, Paweł Józiak, Łukasz Garncarek, Paweł Liskowski, Karolina Szyndler, Andrzej Gretkowski, Julita Ołtusek, Gabriela Nowakowska, Artur Zawłocki, Łukasz Duhr, Paweł Dyda, Michał Turski,
- Abstract summary: We introduce the Arctic-TILT achieving accuracy on par with models 1000$times$ its size on these use cases.
It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens.
The model establishes state-of-the-art results on seven diverse Understanding Document benchmarks, as well as provides reliable confidence scores and quick inference.
- Score: 1.2286461468814107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vast portion of workloads employing LLMs involves answering questions grounded on PDF or scan content. We introduce the Arctic-TILT achieving accuracy on par with models 1000$\times$ its size on these use cases. It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens. The model establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.
Related papers
- Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models [2.6300820904868263]
Financial documents are essential sources of information for regulators, auditors, and financial institutions.<n>These documents tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report.<n>We propose a multistage pipeline that leverages traditional image processing models and OCR extraction, together with compact VLMs for structured field extraction.
arXiv Detail & Related papers (2025-10-27T06:56:08Z) - Advanced Layout Analysis Models for Docling [7.819891138280585]
We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline.<n>Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU.<n>All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.
arXiv Detail & Related papers (2025-09-15T09:20:11Z) - Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model [0.0]
Deploying large language models (LLMs) for structured data extraction in domains such as financial compliance reporting is often impractical for smaller teams due to the high cost of running large architectures and the difficulty of preparing large, high-quality datasets.<n>This work presents a billion- parameter LLaMA-based model fine-tuned with low-rank adaptation on only a few hundred samples per task for extraction, knowledge graph extraction, and named entity recognition.<n>These findings demonstrate that well-tuned small models can deliver stable and accurate structured outputs at a fraction of the computational cost, enabling cost-effective and reliable information extraction pipelines in resource-
arXiv Detail & Related papers (2025-09-10T08:19:07Z) - Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark [6.722613897911759]
Document Haystack is a benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long documents.<n>Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents.
arXiv Detail & Related papers (2025-07-18T19:33:15Z) - WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts [14.966795545558474]
This paper introduces WikiMixQA, a benchmark for evaluating cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages.<n>We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve 70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required.
arXiv Detail & Related papers (2025-06-18T16:09:18Z) - WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? [64.62909376834601]
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments.<n> evaluation of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks.
arXiv Detail & Related papers (2025-05-16T09:09:46Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.
Existing document understanding benchmarks often assess LVLMs using question-answer formats.
We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)
M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning [23.376181947937788]
We propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup.
Experiments show our approach outperforms both RAG and task-agnostic compression methods.
A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
arXiv Detail & Related papers (2025-03-06T21:07:41Z) - M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework [75.95430061891828]
We introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models.
We propose a retrieval-aware tuning approach for efficient and effective multimodal document reading.
arXiv Detail & Related papers (2024-11-09T13:30:38Z) - LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding [103.69014172427026]
Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents.
We present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL) which broadens the capabilities of any LMM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z) - MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding [66.23502779435053]
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks.
Existing benchmarks either contain limited fine-grained evaluation samples mixed with other data, or are confined to object-level assessments in natural images.
We propose using document images with multi-granularity and multi-modal information to supplement natural images.
arXiv Detail & Related papers (2024-10-25T16:00:55Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations [105.10376440302076]
This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions.
It is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens.
Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models.
arXiv Detail & Related papers (2024-07-01T17:59:26Z) - $\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens [64.08660301017302]
There is currently a lack of a standardized benchmark to evaluate this long-context capability.
$infty$Bench is the first benchmark featuring an average data length surpassing 100K tokens.
The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context.
arXiv Detail & Related papers (2024-02-21T11:30:29Z) - Drilling Down into the Discourse Structure with LLMs for Long Document
Question Answering [5.022057415488129]
We propose a suite of techniques that exploit the discourse structure commonly found in documents.
We show how our approach can be combined with textitself-ask reasoning agent to achieve best zero-shot performance in complex multi-hop question answering.
arXiv Detail & Related papers (2023-11-22T18:22:56Z) - Multimodal Document Analytics for Banking Process Automation [4.541582055558865]
The paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business.
It offers practical guidance on how to unlock this potential in day-to-day operations.
arXiv Detail & Related papers (2023-07-21T18:29:04Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)
Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.
We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z) - HADES: Homologous Automated Document Exploration and Summarization [3.3509104620016092]
HADES is designed to streamline the work of professionals dealing with large volumes of documents.
The tool employs a multi-step pipeline that begins with processing PDF documents using topic modeling, summarization, and analysis of the most important words for each topic.
arXiv Detail & Related papers (2023-02-25T15:16:10Z) - The Law of Large Documents: Understanding the Structure of Legal
Contracts Using Visual Cues [0.7425558351422133]
We measure the impact of incorporating visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks.
Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks.
arXiv Detail & Related papers (2021-07-16T21:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.