Related papers: Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

URL: http://arxiv.org/abs/2206.00785v1
Date: Wed, 1 Jun 2022 22:30:30 GMT
Title: Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness
Authors: Christoph Auer (1), Michele Dolfi (1), Andr\'e Carvalho (2), Cesar Berrospi Ramis (1), Peter W. J. Staar (1) ((1) IBM Research, (2) SoftINSA Lda.)
Abstract summary: We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.

Related papers

URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z)
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task [11.672798725644121]
This work strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks.<n>We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats.
arXiv Detail & Related papers (2025-10-11T09:40:34Z)
Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution [0.0]
We introduce a scalable hybrid framework, which is designed to address several important problems.<n>We utilize a pre-trained language model to encode each structured data into corresponding semantic embedding vectors.<n>After retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage.
arXiv Detail & Related papers (2025-09-22T08:05:44Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation [49.36436704082436]
How-to questions are integral to decision-making processes and require dynamic, step-by-step answers. We propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively.
arXiv Detail & Related papers (2024-06-19T09:14:41Z)
Information Extraction from Unstructured data using Augmented-AI and Computer Vision [0.0]
This paper presents a framework for information extraction that combines Augmented Intelligence (A2I) with computer vision and natural language processing techniques.<n>Our approach addresses the limitations of conventional methods by leveraging deep learning architectures for object detection.<n>The proposed methodology demonstrates improved accuracy and efficiency in extracting structured information from diverse document formats.
arXiv Detail & Related papers (2023-12-15T15:27:41Z)
On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications. FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER. We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z)
Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization [0.0]
We propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents. Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes.
arXiv Detail & Related papers (2023-10-02T07:05:19Z)
Data-Efficient Information Extraction from Form-Like Documents [14.567098292973075]
Key challenge is that form-like documents can be laid out in virtually infinitely many ways. Data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types.
arXiv Detail & Related papers (2022-01-07T19:16:49Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
A Privacy-Preserving Distributed Architecture for Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service. It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.