AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
- URL: http://arxiv.org/abs/2409.10016v2
- Date: Sun, 02 Feb 2025 01:48:03 GMT
- Title: AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
- Authors: Huawei Ji, Cheng Deng, Bo Xue, Zhouyang Jin, Jiaxin Ding, Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou,
- Abstract summary: We introduce AceParse, the first comprehensive dataset designed to support the parsing of structured texts.
Based on AceParse, we fine-tuned a multimodal model, named Ace, which accurately parses various structured texts.
This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity.
- Score: 82.33075210051129
- License:
- Abstract: With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models [5.713983191152314]
VTechAGP is the first academic-to-general-audience text paraphrase dataset.
We also propose a novel dynamic soft prompt generative language model, DSPT5.
For training, we leverage a contrastive-generative loss function to learn the keyword in the dynamic prompt.
arXiv Detail & Related papers (2024-11-07T16:06:00Z) - A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including
Structured Full-Text and Citation Network [0.0]
We propose a new version of the data set unarXive.
The resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years.
In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification.
arXiv Detail & Related papers (2023-03-27T07:40:59Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.