Related papers: DocFusion: A Unified Framework for Document Parsing Tasks

Related papers

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z)
ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links [57.514511353084565]
We introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links.<n>We apply our framework in two distinct domains -- peer review and news.<n>The resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review.
arXiv Detail & Related papers (2025-09-01T11:32:24Z)
Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion [7.1593100791258335]
We propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion.<n>Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks.<n>We also devise a novel textbfPrior textbfPool, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features.
arXiv Detail & Related papers (2025-08-06T03:30:39Z)
Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations [2.9798896492745537]
We present a unified dataset for document Question-Answering (QA) We reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task. On the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box.
arXiv Detail & Related papers (2025-01-06T21:46:22Z)
Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$) GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training. Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z)
SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding [23.910783272007407]
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies.
arXiv Detail & Related papers (2024-08-27T03:31:24Z)
On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications. FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER. We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z)
Retrieval-Generation Synergy Augmented Large Language Models [30.53260173572783]
We propose an iterative retrieval-generation collaborative framework. We conduct experiments on four question answering datasets, including single-hop QA and multi-hop QA tasks.
arXiv Detail & Related papers (2023-10-08T12:50:57Z)
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding [93.92313947913831]
We introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
arXiv Detail & Related papers (2023-08-19T17:32:34Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.