PAWLS: PDF Annotation With Labels and Structure
- URL: http://arxiv.org/abs/2101.10281v1
- Date: Mon, 25 Jan 2021 18:02:43 GMT
- Title: PAWLS: PDF Annotation With Labels and Structure
- Authors: Mark Neumann, Zejiang Shen, Sam Skjonsberg
- Abstract summary: We present PDF with Labels and Structure (PAWLS), a new annotation tool for the PDF document format.
PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes.
A read-only PAWLS server is available at https://pawls.apps.allenai.org/.
- Score: 4.984601297028257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adobe's Portable Document Format (PDF) is a popular way of distributing
view-only documents with a rich visual markup. This presents a challenge to NLP
practitioners who wish to use the information contained within PDF documents
for training models or data analysis, because annotating these documents is
difficult. In this paper, we present PDF Annotation with Labels and Structure
(PAWLS), a new annotation tool designed specifically for the PDF document
format. PAWLS is particularly suited for mixed-mode annotation and scenarios in
which annotators require extended context to annotate accurately. PAWLS
supports span-based textual annotation, N-ary relations and freeform,
non-textual bounding boxes, all of which can be exported in convenient formats
for training multi-modal machine learning models. A read-only PAWLS server is
available at https://pawls.apps.allenai.org/ and the source code is available
at https://github.com/allenai/pawls.
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering [36.40110520952274]
This paper introduces a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering.
The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly.
The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.
arXiv Detail & Related papers (2024-03-30T18:11:39Z) - appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit [9.66954231321555]
appify is a Python-based PDF-to-JSON conversion toolkit for academic papers.
It parses a PDF file using several visual-based document layout analysis models and rule-based text processing approaches.
arXiv Detail & Related papers (2023-10-02T13:48:16Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl.
We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z) - DoSA : A System to Accelerate Annotations on Business Documents with
Human-in-the-Loop [0.0]
DoSA (Document Specific Automated s) helps annotators in generating initial annotations automatically using our novel bootstrap approach.
An open-source ready-to-use implementation is made available on GitHub.
arXiv Detail & Related papers (2022-11-09T15:04:07Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - MarkupLM: Pre-training of Text and Markup Language for Visually-rich
Document Understanding [35.35388421383703]
Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU)
We propose MarkupLM for document understanding tasks with markup languages as the backbone.
Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.
arXiv Detail & Related papers (2021-10-16T09:17:28Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.