Related papers: PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

URL: http://arxiv.org/abs/2409.05125v1
Date: Sun, 8 Sep 2024 15:08:51 GMT
Title: PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction
Authors: Lei Sheng, Shuai-Shuai Xu,
Abstract summary: Extracting information from documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Camelot and pdfnumber can solely extract tables from digital PDFs. PP-OCRV2 can comprehensively extract image-based PDFs and tables from pictures.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: https://github.com/CycloneBoy/pdf_table.

Related papers

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models [17.018144344175973]
olmOCR is an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD.
arXiv Detail & Related papers (2025-02-25T18:38:38Z)
A Comparative Study of PDF Parsing Tools Across Diverse Document Categories [0.0]
We compare 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all text extractions struggled with Scientific and Patent documents. In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories.
arXiv Detail & Related papers (2024-10-13T15:11:31Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition [55.153629718464565]
We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
arXiv Detail & Related papers (2024-09-20T01:26:32Z)
tabulapdf: An R Package to Extract Tables from PDF Documents [0.0]
tabulapdf is an R package that utilizes the Tabula Java library to import tables from PDF files directly into R. It can reduce time and effort in data extraction processes in fields like investigative journalism.
arXiv Detail & Related papers (2024-08-25T22:02:05Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit [9.66954231321555]
appify is a Python-based PDF-to-JSON conversion toolkit for academic papers. It parses a PDF file using several visual-based document layout analysis models and rule-based text processing approaches.
arXiv Detail & Related papers (2023-10-02T13:48:16Z)
SEMv2: Table Separation Line Detection Based on Instance Segmentation [96.36188168694781]
We propose an accurate table structure recognizer, termed SEMv2 (SEM: Split, Embed and Merge) We address the table separation line instance-level discrimination problem and introduce a table separation line detection strategy based on conditional convolution. To comprehensively evaluate the SEMv2, we also present a more challenging dataset for table structure recognition, dubbed iFLYTAB.
arXiv Detail & Related papers (2023-03-08T05:15:01Z)
TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets [5.5347995556789105]
We devise a system capable of parsing tables in both native PDFs and scanned images with high precision. We also create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism.
arXiv Detail & Related papers (2022-01-05T15:21:06Z)
Multi-Type-TD-TSR -- Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: from OCR to Structured Table Representations [63.98463053292982]
The recognition of tables consists of two main tasks, namely table detection and table structure recognition. Recent work shows a clear trend towards deep learning approaches coupled with the use of transfer learning for the task of table structure recognition. We present a multistage pipeline named Multi-Type-TD-TSR, which offers an end-to-end solution for the problem of table recognition.
arXiv Detail & Related papers (2021-05-23T21:17:18Z)
PAWLS: PDF Annotation With Labels and Structure [4.984601297028257]
We present PDF with Labels and Structure (PAWLS), a new annotation tool for the PDF document format. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes. A read-only PAWLS server is available at https://pawls.apps.allenai.org/.
arXiv Detail & Related papers (2021-01-25T18:02:43Z)
GFTE: Graph-based Financial Table Extraction [66.26206038522339]
In financial industry and many other fields, tables are often disclosed in unstructured digital files, e.g. Portable Document Format (PDF) and images. We publish a standard Chinese dataset named FinTab, which contains more than 1,600 financial tables of diverse kinds. We propose a novel graph-based convolutional network model named GFTE as a baseline for future comparison.
arXiv Detail & Related papers (2020-03-17T07:10:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.