Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
- URL: http://arxiv.org/abs/2506.21604v1
- Date: Thu, 19 Jun 2025 18:05:00 GMT
- Title: Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
- Authors: Varun Mannam, Fang Wang, Xin Chen,
- Abstract summary: We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of integrating cross-modal inputs.<n>Our approach establishes quantitative relationships between technical metrics and user-centric trust measures.<n>This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.
- Score: 5.861057085203687
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.
Related papers
- Decision Quality Evaluation Framework at Pinterest [0.36944296923226316]
The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs)<n>We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage.<n>The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
arXiv Detail & Related papers (2026-02-17T18:45:55Z) - Red Teaming Large Reasoning Models [26.720095252284818]
Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi-step reasoning tasks.<n>LRMs introduce novel safety and reliability risks, such as CoT-hijacking and prompt-induced inefficiencies.<n>We propose RT-LRM, a unified benchmark designed to assess the trustworthiness of LRMs.
arXiv Detail & Related papers (2025-11-29T09:45:03Z) - ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search [69.60882125603133]
We present ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents.<n>Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.
arXiv Detail & Related papers (2025-09-27T22:36:42Z) - Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation [51.19622266249408]
MultiTrust-X is a benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs.<n>Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets.<n>Our experiments reveal significant vulnerabilities in current models.
arXiv Detail & Related papers (2025-08-21T09:00:01Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models [6.62851757612838]
Current confidence estimation methods for large language models (LLMs) neglect the relevance between responses and contextual information.<n>We propose CRUX, which integrates context faithfulness and consistency for confidence estimation via two novel metrics.<n> Experiments across three benchmark datasets demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
arXiv Detail & Related papers (2025-08-01T12:58:34Z) - Semantic Chain-of-Trust: Autonomous Trust Orchestration for Collaborator Selection via Hypergraph-Aided Agentic AI [57.58120823855315]
This paper proposes an autonomous trust orchestration method based on a new concept of semantic chain-of-trust.<n>Our technique employs agentic AI and hypergraph to establish and maintain trust relationships among devices.<n> Experimental results demonstrate that the proposed method achieves resource-efficient trust evaluation.
arXiv Detail & Related papers (2025-07-31T13:53:25Z) - Structured Relevance Assessment for Robust Retrieval-Augmented Language Models [0.0]
We introduce a framework for structured relevance assessment that enhances RALM robustness.<n>Our approach employs a multi-dimensional scoring system that considers both semantic matching and source reliability.<n>Preliminary evaluations demonstrate significant reductions in hallucination rates and improved transparency in reasoning processes.
arXiv Detail & Related papers (2025-07-28T19:20:04Z) - AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents [0.0]
This study presents a modular, multi-agent system for the automated review of highly structured enterprise business documents using AI agents.<n>It uses modern orchestration tools such as LangChain, CrewAI, TruLens, and Guidance to enable section-by-section evaluation of documents.<n>It achieves 99% information consistency (vs. 92% for humans), halving error and bias rates, and reducing average review time from 30 to 2.5 minutes per document.
arXiv Detail & Related papers (2025-06-23T17:46:15Z) - CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision [15.604947362541415]
CrEst is a weakly supervised framework for assessing the credibility of context documents during inference.<n>Experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-06-17T18:44:21Z) - On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective [334.48358909967845]
Generative Foundation Models (GenFMs) have emerged as transformative tools.<n>Their widespread adoption raises critical concerns regarding trustworthiness across dimensions.<n>This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv Detail & Related papers (2025-02-20T06:20:36Z) - Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey [92.36487127683053]
Retrieval-Augmented Generation (RAG) is an advanced technique designed to address the challenges of Artificial Intelligence-Generated Content (AIGC)<n>RAG provides reliable and up-to-date external knowledge, reduces hallucinations, and ensures relevant context across a wide range of tasks.<n>Despite RAG's success and potential, recent studies have shown that the RAG paradigm also introduces new risks, including privacy concerns, adversarial attacks, and accountability issues.
arXiv Detail & Related papers (2025-02-08T06:50:47Z) - MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation [22.19073789961769]
generative Large Language Models (LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues.
We propose the MATEval: A "Multi-Agent Text Evaluation framework"
Our framework incorporates self-reflection and Chain-of-Thought strategies, along with feedback mechanisms, to enhance the depth and breadth of the evaluation process.
arXiv Detail & Related papers (2024-03-28T10:41:47Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z) - DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - Towards a multi-stakeholder value-based assessment framework for
algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values.
We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.