Benchmarking LLMs for Environmental Review and Permitting
        - URL: http://arxiv.org/abs/2407.07321v3
 - Date: Thu, 12 Jun 2025 03:39:58 GMT
 - Title: Benchmarking LLMs for Environmental Review and Permitting
 - Authors: Rounak Meyur, Hung Phan, Koby Hayashi, Ian Stewart, Shivam Sharma, Sarthak Chaturvedi, Mike Parker, Dan Nally, Sadie Montgomery, Karl Pazdernik, Ali Jannesari, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana, Anurag Acharya, 
 - Abstract summary: The National Environment Policy Act (NEPA) requires federal agencies to consider the environmental impacts of proposed actions.<n>Large Language Model (LLM)s' effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes.<n>We present NEPAQuAD, the first comprehensive benchmark derived from EIS documents.
 - Score: 10.214978239010849
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   The National Environment Policy Act (NEPA) stands as a foundational piece of environmental legislation in the United States, requiring federal agencies to consider the environmental impacts of their proposed actions. The primary mechanism for achieving this is through the preparation of Environmental Assessments (EAs) and, for significant impacts, comprehensive Environmental Impact Statements (EIS). Large Language Model (LLM)s' effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes. To address this gap, we present NEPA Question and Answering Dataset (NEPAQuAD), the first comprehensive benchmark derived from EIS documents, along with a modular and transparent evaluation pipeline, MAPLE, to assess LLM performance on NEPA-focused regulatory reasoning tasks. Our benchmark leverages actual EIS documents to create diverse question types, ranging from factual to complex problem-solving ones. We built a modular and transparent evaluation pipeline to test both closed- and open-source models in zero-shot or context-driven QA benchmarks. We evaluate five state-of-the-art LLMs using our framework to assess both their prior knowledge and their ability to process NEPA-specific information. The experimental results reveal that all the models consistently achieve their highest performance when provided with the gold passage as context. While comparing the other context-driven approaches for each model, Retrieval Augmented Generation (RAG)-based approaches substantially outperform PDF document contexts, indicating that neither model is well suited for long-context question-answering tasks. Our analysis suggests that NEPA-focused regulatory reasoning tasks pose a significant challenge for LLMs, particularly in terms of understanding the complex semantics and effectively processing the lengthy regulatory documents. 
 
       
      
        Related papers
        - Audit, Alignment, and Optimization of LM-Powered Subroutines with   Application to Public Comment Processing [2.0417058495510374]
We propose a framework for declaring LM-powered subroutines for use within conventional asynchronous code.<n>We use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review.
arXiv  Detail & Related papers  (2025-07-10T18:52:09Z) - LLM-based HSE Compliance Assessment: Benchmark, Performance, and   Advancements [26.88382777632026]
HSE-Bench is the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of large language models.<n>It comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos.<n>We conduct evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models.
arXiv  Detail & Related papers  (2025-05-29T01:02:53Z) - Towards Contamination Resistant Benchmarks [0.6906005491572401]
evaluating large language models (LLMs) properly is crucial for understanding their potential and addressing concerns such as safety.<n> contamination stands out as a key issue that undermines the reliability of evaluations.<n>We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark.
arXiv  Detail & Related papers  (2025-05-13T09:35:40Z) - Sustainability via LLM Right-sizing [21.17523328451591]
Large language models (LLMs) have become increasingly embedded in organizational.<n>This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks.<n>Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint.
arXiv  Detail & Related papers  (2025-04-17T04:00:40Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and   Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv  Detail & Related papers  (2025-01-24T06:39:38Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a   Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv  Detail & Related papers  (2025-01-20T06:35:01Z) - Knowledge Graphs, Large Language Models, and Hallucinations: An NLP   Perspective [5.769786334333616]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, and others.
They face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses.
This paper discusses these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations.
arXiv  Detail & Related papers  (2024-11-21T16:09:05Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens.
We use detective novels as data sources, which naturally have various reasoning elements.
We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv  Detail & Related papers  (2024-09-04T06:28:22Z) - Prompting Large Language Models with Knowledge Graphs for Question   Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge.
We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions.
Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv  Detail & Related papers  (2024-05-10T15:10:20Z) - A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language   Models [71.25225058845324]
Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation.
Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge.
RA-LLMs have emerged to harness external and authoritative knowledge bases, rather than relying on the model's internal knowledge.
arXiv  Detail & Related papers  (2024-05-10T02:48:45Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large   Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv  Detail & Related papers  (2024-02-23T01:30:39Z) - Leveraging Large Language Models for NLG Evaluation: Advances and   Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv  Detail & Related papers  (2024-01-13T15:59:09Z) - Can LLMs Grade Short-Answer Reading Comprehension Questions : An   Empirical Study with a Novel Dataset [0.0]
This paper investigates the potential for the newest version of Large Language Models (LLMs) to be used in short answer questions for formative assessments.
It introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana.
The paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters.
arXiv  Detail & Related papers  (2023-10-26T17:05:40Z) - A Comprehensive Evaluation of Large Language Models on Legal Judgment
  Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications.
Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks.
We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv  Detail & Related papers  (2023-10-18T07:38:04Z) - NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear
  Domain [0.0]
NuclearQA is a human-made benchmark of 100 questions to evaluate language models in the nuclear domain.
We show how the mix of several types of questions makes our benchmark uniquely capable of evaluating models in the nuclear domain.
arXiv  Detail & Related papers  (2023-10-17T01:27:20Z) - Mastering the Task of Open Information Extraction with Large Language
  Models and Consistent Reasoning Environment [52.592199835286394]
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts.
Large language models (LLMs) have exhibited remarkable in-context learning capabilities.
arXiv  Detail & Related papers  (2023-10-16T17:11:42Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
 Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv  Detail & Related papers  (2023-10-09T07:27:15Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
  Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv  Detail & Related papers  (2023-10-05T00:04:12Z) - Improving Open Information Extraction with Large Language Models: A
  Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv  Detail & Related papers  (2023-09-07T01:35:24Z) - Investigating the Factual Knowledge Boundary of Large Language Models   with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv  Detail & Related papers  (2023-07-20T16:46:10Z) - When Giant Language Brains Just Aren't Enough! Domain Pizzazz with
  Knowledge Sparkle Dust [15.484175299150904]
This paper presents an empirical analysis aimed at bridging the gap in adapting large language models to practical use cases.
We select the question answering (QA) task of insurance as a case study due to its challenge of reasoning.
Based on the task we design a new model relied on LLMs which are empowered by additional knowledge extracted from insurance policy rulebooks and DBPedia.
arXiv  Detail & Related papers  (2023-05-12T03:49:59Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.