eSapiens: A Real-World NLP Framework for Multimodal Document Understanding and Enterprise Knowledge Processing
- URL: http://arxiv.org/abs/2506.16768v1
- Date: Fri, 20 Jun 2025 06:07:20 GMT
- Title: eSapiens: A Real-World NLP Framework for Multimodal Document Understanding and Enterprise Knowledge Processing
- Authors: Isaac Shi, Zeyuan Li, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi,
- Abstract summary: We introduce eSapiens, a unified question-answering system designed for enterprise settings.<n>eSapiens bridges structured databases and unstructured corpora via a dual-module architecture.<n>We evaluate eSapiens on the RAGTruth benchmark, analyzing performance across key dimensions such as completeness, hallucination, and context utilization.
- Score: 6.450269621190948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce eSapiens, a unified question-answering system designed for enterprise settings, which bridges structured databases and unstructured textual corpora via a dual-module architecture. The system combines a Text-to-SQL planner with a hybrid Retrieval-Augmented Generation (RAG) pipeline, enabling natural language access to both relational data and free-form documents. To enhance answer faithfulness, the RAG module integrates dense and sparse retrieval, commercial reranking, and a citation verification loop that ensures grounding consistency. We evaluate eSapiens on the RAGTruth benchmark across five leading large language models (LLMs), analyzing performance across key dimensions such as completeness, hallucination, and context utilization. Results demonstrate that eSapiens outperforms a FAISS baseline in contextual relevance and generation quality, with optional strict-grounding controls for high-stakes scenarios. This work provides a deployable framework for robust, citation-aware question answering in real-world enterprise applications.
Related papers
- Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z) - Large Language Models are Good Relational Learners [55.40941576497973]
We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for large language models (LLMs)<n>Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to process and reason over complex entity relationships.
arXiv Detail & Related papers (2025-06-06T04:07:55Z) - HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation [11.53083922927901]
HM-RAG is a novel Hierarchical Multi-agent Multimodal RAG framework.<n>It pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data.
arXiv Detail & Related papers (2025-04-13T06:55:33Z) - Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases [78.62158923194153]
Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge.<n>We propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework.
arXiv Detail & Related papers (2025-02-27T17:42:52Z) - OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z) - Transparent NLP: Using RAG and LLM Alignment for Privacy Q&A [15.86510147965235]
General Data Protection Regulation requires precise processing information to be clear and accessible.<n>This paper examines state-of-the-art Retrieval Generation (RAG) systems enhanced with alignment techniques to fulfill obligations.
arXiv Detail & Related papers (2025-02-10T16:42:00Z) - Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval [6.058427379240698]
We introduce a new dataset specifically designed for Japanese legal contexts.<n>In the first phase, the model learns a broad understanding of global contexts, enhancing its generalization.<n>In the second phase, the model is fine-tuned to address complex queries specific to legal scenarios.<n>Our pipeline proves effective in English contexts, surpassing comparable baselines on the MS MARCO dataset.
arXiv Detail & Related papers (2024-12-03T10:52:49Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.