AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
- URL: http://arxiv.org/abs/2505.01435v1
- Date: Wed, 23 Apr 2025 18:38:41 GMT
- Title: AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
- Authors: Carlo Siebenschuh, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Arham Khan, Khalid Hossain, Yadu Babuji, Nicholas Chia, Venkatram Vishwanath, Rick Stevens, Arvind Ramanathan, Ian Foster, Robert Underwood,
- Abstract summary: We introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse)<n>AdaParse is a data-driven strategy for assigning an appropriate parsed document to each document.<n>We show that AdaParse, when compared to state-of-the-art parsings, improves throughput by $17$times while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents.
- Score: 33.22885510488797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
Related papers
- AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing [82.33075210051129]
We introduce AceParse, the first comprehensive dataset designed to support the parsing of structured texts.<n>Based on AceParse, we fine-tuned a multimodal model, named Ace, which accurately parses various structured texts.<n>This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity.
arXiv Detail & Related papers (2024-09-16T06:06:34Z) - Automatic Prediction of the Performance of Every Parser [0.0]
We present a new performance prediction (PPP) model using machine translation performance prediction system (MTPPS)
This new system, MTPPS-PPP, can predict the performance of any language and can be useful for estimating the grammatical difficulty when understanding a text.
arXiv Detail & Related papers (2024-07-06T15:49:24Z) - Deepparse : An Extendable, and Fine-Tunable State-Of-The-Art Library for
Parsing Multinational Street Addresses [0.0]
This paper presents Deepparse, a Python open-source, extendable, fine-tunable address parsing solution under LGPL-3.0 licence.
It can parse addresses written in any language and use any address standard.
The library supports fine-tuning with new data to generate a custom address.
arXiv Detail & Related papers (2023-11-20T15:37:33Z) - Evaluating the Impact of Source Code Parsers on ML4SE Models [3.699097874146491]
We evaluate two models, namely, Supernorm2Seq and TreeLSTM, in the name prediction language.
We show that trees built by differents vary in their structure and content.
We then analyze how this diversity affects the models' quality.
arXiv Detail & Related papers (2022-06-17T12:10:04Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Strongly Incremental Constituency Parsing with Graph Neural Networks [70.16880251349093]
Parsing sentences into syntax trees can benefit downstream applications in NLP.
Transition-baseds build trees by executing actions in a state transition system.
Existing transition-baseds are predominantly based on the shift-reduce transition system.
arXiv Detail & Related papers (2020-10-27T19:19:38Z) - A Practical Chinese Dependency Parser Based on A Large-scale Dataset [21.359679124869402]
Dependency parsing is a longstanding natural language processing task, with its outputs crucial to various downstream tasks.
Recently, neural network based (NN-based) dependency has achieved significant progress and obtained the state-of-the-art results.
As we all know, NN-based approaches require massive amounts of labeled training data, which is very expensive because it requires human annotation by experts.
arXiv Detail & Related papers (2020-09-02T08:41:46Z) - Towards Instance-Level Parser Selection for Cross-Lingual Transfer of
Dependency Parsers [59.345145623931636]
We argue for a novel cross-lingual transfer paradigm: instance-level selection (ILPS)
We present a proof-of-concept study focused on instance-level selection in the framework of delexicalized transfer.
arXiv Detail & Related papers (2020-04-16T13:18:55Z) - Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation.
We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.