Evaluating Retrieval Augmented Generative Models for Document Queries in Transportation Safety
- URL: http://arxiv.org/abs/2504.07022v1
- Date: Wed, 09 Apr 2025 16:37:03 GMT
- Title: Evaluating Retrieval Augmented Generative Models for Document Queries in Transportation Safety
- Authors: Chad Melton, Alex Sorokine, Steve Peterson,
- Abstract summary: This study evaluates the performance of three fine-tuned generative models, ChatGPT, Google's Vertex AI, and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA.<n>We developed 100 realistic queries relevant to route planning and permitting requirements.<n>Results demonstrated that the RAG-augmented LLaMA models significantly outperformed Vertex AI and ChatGPT, providing more detailed and generally accurate information.
- Score: 0.7373617024876725
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Applications of generative Large Language Models LLMs are rapidly expanding across various domains, promising significant improvements in workflow efficiency and information retrieval. However, their implementation in specialized, high-stakes domains such as hazardous materials transportation is challenging due to accuracy and reliability concerns. This study evaluates the performance of three fine-tuned generative models, ChatGPT, Google's Vertex AI, and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA in retrieving regulatory information essential for hazardous material transportation compliance in the United States. Utilizing approximately 40 publicly available federal and state regulatory documents, we developed 100 realistic queries relevant to route planning and permitting requirements. Responses were qualitatively rated based on accuracy, detail, and relevance, complemented by quantitative assessments of semantic similarity between model outputs. Results demonstrated that the RAG-augmented LLaMA models significantly outperformed Vertex AI and ChatGPT, providing more detailed and generally accurate information, despite occasional inconsistencies. This research introduces the first known application of RAG in transportation safety, emphasizing the need for domain-specific fine-tuning and rigorous evaluation methodologies to ensure reliability and minimize the risk of inaccuracies in high-stakes environments.
Related papers
- The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation.<n>LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs.<n>This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z) - Context Awareness Gate For Retrieval Augmented Generation [2.749898166276854]
Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions.<n>Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline.<n>We investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs.
arXiv Detail & Related papers (2024-11-25T06:48:38Z) - SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - VERA: Validation and Enhancement for Retrieval Augmented systems [0.0]
We propose textbfVERA (textbfValidation and textbfEnhancement for textbfRetrieval textbfAugmented systems), a system designed to evaluate and enhance the retrieved context before response generation.
VERA employs an evaluator-cum-enhancer LLM that first checks if external retrieval is necessary, evaluates the relevance and redundancy of the retrieved context, and refines it to eliminate non-essential information.
arXiv Detail & Related papers (2024-09-18T16:10:47Z) - Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs)
We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Benchmark Data Contamination of Large Language Models: A Survey [5.806534973464769]
This paper reviews the complex challenge of Benchmark Data Contamination (BDC) in Large Language Models (LLMs) evaluation.
It explores alternative assessment methods to mitigate the risks associated with traditional benchmarks.
arXiv Detail & Related papers (2024-06-06T16:41:39Z) - Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models [54.55088169443828]
Chain-of-Noting (CoN) is a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios.
CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.
arXiv Detail & Related papers (2023-11-15T18:54:53Z) - Safe Model-Based Multi-Agent Mean-Field Reinforcement Learning [48.667697255912614]
Mean-field reinforcement learning addresses the policy of a representative agent interacting with the infinite population of identical agents.
We propose Safe-M$3$-UCRL, the first model-based mean-field reinforcement learning algorithm that attains safe policies even in the case of unknown transitions.
Our algorithm effectively meets the demand in critical areas while ensuring service accessibility in regions with low demand.
arXiv Detail & Related papers (2023-06-29T15:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.