Delivering Document Conversion as a Cloud Service with High Throughput
and Responsiveness
- URL: http://arxiv.org/abs/2206.00785v1
- Date: Wed, 1 Jun 2022 22:30:30 GMT
- Title: Delivering Document Conversion as a Cloud Service with High Throughput
and Responsiveness
- Authors: Christoph Auer (1), Michele Dolfi (1), Andr\'e Carvalho (2), Cesar
Berrospi Ramis (1), Peter W. J. Staar (1) ((1) IBM Research, (2) SoftINSA
Lda.)
- Abstract summary: We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced.
Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document understanding is a key business process in the data-driven economy
since documents are central to knowledge discovery and business insights.
Converting documents into a machine-processable format is a particular
challenge here due to their huge variability in formats and complex structure.
Accordingly, many algorithms and machine-learning methods emerged to solve
particular tasks such as Optical Character Recognition (OCR), layout analysis,
table-structure recovery, figure understanding, etc. We observe the adoption of
such methods in document understanding solutions offered by all major cloud
providers. Yet, publications outlining how such services are designed and
optimized to scale in the cloud are scarce. In this paper, we focus on the case
of document conversion to illustrate the particular challenges of scaling a
complex data processing pipeline with a strong reliance on machine-learning
methods on cloud infrastructure. Our key objective is to achieve high
scalability and responsiveness for different workload profiles in a
well-defined resource budget. We outline the requirements, design, and
implementation choices of our document conversion service and reflect on the
challenges we faced. Evidence for the scaling behavior and resource efficiency
is provided for two alternative workload distribution strategies and deployment
configurations. Our best-performing method achieves sustained throughput of
over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
Related papers
- KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation [49.36436704082436]
How-to questions are integral to decision-making processes and require dynamic, step-by-step answers.
We propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively.
arXiv Detail & Related papers (2024-06-19T09:14:41Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Data Efficient Training of a U-Net Based Architecture for Structured
Documents Localization [0.0]
We propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents.
Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes.
arXiv Detail & Related papers (2023-10-02T07:05:19Z) - Data-Efficient Information Extraction from Form-Like Documents [14.567098292973075]
Key challenge is that form-like documents can be laid out in virtually infinitely many ways.
Data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types.
arXiv Detail & Related papers (2022-01-07T19:16:49Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - A Privacy-Preserving Distributed Architecture for
Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service.
It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.