Document Intelligence in the Era of Large Language Models: A Survey
- URL: http://arxiv.org/abs/2510.13366v1
- Date: Wed, 15 Oct 2025 09:57:03 GMT
- Title: Document Intelligence in the Era of Large Language Models: A Survey
- Authors: Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier,
- Abstract summary: Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs)<n>This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.
- Score: 3.7711554135888288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.
Related papers
- Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey [40.20905051575087]
In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier.<n>This paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning.
arXiv Detail & Related papers (2025-09-29T06:13:14Z) - The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM [27.800308082023285]
Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos.<n>The continuous evolution of deep model architectures has driven innovation in VAD methodologies.<n>The rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field.
arXiv Detail & Related papers (2025-07-29T10:07:24Z) - Large Language Models in Argument Mining: A Survey [15.041650203089057]
Argument Mining (AM) focuses on extracting argumentative structures from text.<n>The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning.<n>This survey systematically synthesizes recent advancements in LLM-driven AM.
arXiv Detail & Related papers (2025-06-19T15:12:58Z) - How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z) - A Survey of Model Architectures in Information Retrieval [59.61734783818073]
The period from 2019 to the present has represented one of the biggest paradigm shifts in information retrieval (IR) and natural language processing (NLP)<n>We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs)<n>We conclude with a forward-looking discussion of emerging challenges and future directions.
arXiv Detail & Related papers (2025-02-20T18:42:58Z) - A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence.
This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z) - From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems.
The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness.
This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Document Automation Architectures: Updated Survey in Light of Large
Language Models [2.990411348977783]
This paper surveys the current state of the art in document automation (DA)
The objective of DA is to reduce the manual effort during the generation of documents by automatically creating and integrating input from different sources and assembling documents conforming to defined templates.
There have been reviews of commercial solutions of DA, particularly in the legal domain, but to date there has been no comprehensive review of the academic research on DA architectures and technologies.
arXiv Detail & Related papers (2023-08-18T06:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.