Related papers: appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit

appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit

URL: http://arxiv.org/abs/2310.01206v2
Date: Tue, 3 Oct 2023 13:19:40 GMT
Title: appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit
Authors: Atsuki Yamaguchi, Terufumi Morishita
Abstract summary: appify is a Python-based PDF-to-JSON conversion toolkit for academic papers. It parses a PDF file using several visual-based document layout analysis models and rule-based text processing approaches.
Score: 9.66954231321555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present appjsonify, a Python-based PDF-to-JSON conversion toolkit for academic papers. It parses a PDF file using several visual-based document layout analysis models and rule-based text processing approaches. appjsonify is a flexible tool that allows users to easily configure the processing pipeline to handle a specific format of a paper they wish to process. We are publicly releasing appjsonify as an easy-to-install toolkit available via PyPI and GitHub.

Related papers

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models [17.018144344175973]
olmOCR is an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD.
arXiv Detail & Related papers (2025-02-25T18:38:38Z)
PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers. We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z)
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories [0.0]
We compare 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all text extractions struggled with Scientific and Patent documents. In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories.
arXiv Detail & Related papers (2024-10-13T15:11:31Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction [0.0]
Extracting information from documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Camelot and pdfnumber can solely extract tables from digital PDFs. PP-OCRV2 can comprehensively extract image-based PDFs and tables from pictures.
arXiv Detail & Related papers (2024-09-08T15:08:51Z)
Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z)
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules. We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z)
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models [90.96816639172464]
Large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations.
arXiv Detail & Related papers (2023-08-01T17:21:38Z)
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z)
XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model. XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z)
textless-lib: a Library for Textless Spoken Language Processing [50.070693765984075]
We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability.
arXiv Detail & Related papers (2022-02-15T12:39:42Z)
Mill.jl and JsonGrinder.jl: automated differentiable feature extraction for learning from raw JSON data [0.0]
Learning from raw data input is one of the key components of successful applications of machine learning methods. Learning from raw data input is one of the key components of successful applications of machine learning methods.
arXiv Detail & Related papers (2021-05-19T13:02:10Z)
PAWLS: PDF Annotation With Labels and Structure [4.984601297028257]
We present PDF with Labels and Structure (PAWLS), a new annotation tool for the PDF document format. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes. A read-only PAWLS server is available at https://pawls.apps.allenai.org/.
arXiv Detail & Related papers (2021-01-25T18:02:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.