appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit
- URL: http://arxiv.org/abs/2310.01206v2
- Date: Tue, 3 Oct 2023 13:19:40 GMT
- Title: appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit
- Authors: Atsuki Yamaguchi, Terufumi Morishita
- Abstract summary: appify is a Python-based PDF-to-JSON conversion toolkit for academic papers.
It parses a PDF file using several visual-based document layout analysis models and rule-based text processing approaches.
- Score: 9.66954231321555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present appjsonify, a Python-based PDF-to-JSON conversion toolkit for
academic papers. It parses a PDF file using several visual-based document
layout analysis models and rule-based text processing approaches. appjsonify is
a flexible tool that allows users to easily configure the processing pipeline
to handle a specific format of a paper they wish to process. We are publicly
releasing appjsonify as an easy-to-install toolkit available via PyPI and
GitHub.
Related papers
- A Comparative Study of PDF Parsing Tools Across Diverse Document Categories [0.0]
We compare 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset.
For text extraction, PyMuPDF and pypdfium generally outperformed others, but all text extractions struggled with Scientific and Patent documents.
In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories.
arXiv Detail & Related papers (2024-10-13T15:11:31Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction [0.0]
Extracting information from documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages.
Camelot and pdfnumber can solely extract tables from digital PDFs.
PP-OCRV2 can comprehensively extract image-based PDFs and tables from pictures.
arXiv Detail & Related papers (2024-09-08T15:08:51Z) - Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models.
Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z) - pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules.
We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z) - Tool Documentation Enables Zero-Shot Tool-Usage with Large Language
Models [90.96816639172464]
Large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage.
We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations.
arXiv Detail & Related papers (2023-08-01T17:21:38Z) - CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl.
We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - textless-lib: a Library for Textless Spoken Language Processing [50.070693765984075]
We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area.
We describe the building blocks that the library provides and demonstrate its usability.
arXiv Detail & Related papers (2022-02-15T12:39:42Z) - Mill.jl and JsonGrinder.jl: automated differentiable feature extraction
for learning from raw JSON data [0.0]
Learning from raw data input is one of the key components of successful applications of machine learning methods.
Learning from raw data input is one of the key components of successful applications of machine learning methods.
arXiv Detail & Related papers (2021-05-19T13:02:10Z) - PAWLS: PDF Annotation With Labels and Structure [4.984601297028257]
We present PDF with Labels and Structure (PAWLS), a new annotation tool for the PDF document format.
PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes.
A read-only PAWLS server is available at https://pawls.apps.allenai.org/.
arXiv Detail & Related papers (2021-01-25T18:02:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.