Related papers: Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

URL: http://arxiv.org/abs/2507.08109v1
Date: Thu, 10 Jul 2025 18:52:09 GMT
Title: Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing
Authors: Reilly Raab, Mike Parker, Dan Nally, Sadie Montgomery, Anastasia Bernat, Sai Munikoti, Sameera Horawalavithana,
Abstract summary: We propose a framework for declaring LM-powered subroutines for use within conventional asynchronous code.<n>We use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review.
Score: 2.0417058495510374
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of language models (LMs) has the potential to dramatically accelerate tasks that may be cast to text-processing; however, real-world adoption is hindered by concerns regarding safety, explainability, and bias. How can we responsibly leverage LMs in a transparent, auditable manner -- minimizing risk and allowing human experts to focus on informed decision-making rather than data-processing or prompt engineering? In this work, we propose a framework for declaring statically typed, LM-powered subroutines (i.e., callable, function-like procedures) for use within conventional asynchronous code -- such that sparse feedback from human experts is used to improve the performance of each subroutine online (i.e., during use). In our implementation, all LM-produced artifacts (i.e., prompts, inputs, outputs, and data-dependencies) are recorded and exposed to audit on demand. We package this framework as a library to support its adoption and continued development. While this framework may be applicable across several real-world decision workflows (e.g., in healthcare and legal fields), we evaluate it in the context of public comment processing as mandated by the 1969 National Environmental Protection Act (NEPA): Specifically, we use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review. We quantitatively evaluate the application by comparing its outputs (when operating without human feedback) to historical ``ground-truth'' data as labelled by human annotators during the preparation of official environmental impact statements.

Related papers

ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness [2.5967788365637103]
Large language models (LLMs) are increasingly valuable to corporate data management due to their ability to process text from various document formats.<n>This work establishes a foundation for sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.
arXiv Detail & Related papers (2025-06-01T11:24:23Z)
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z)
Can AI automatically analyze public opinion? A LLM agents-based agentic pipeline for timely public opinion analysis [3.1894345568992346]
This study proposes and implements the first LLM agents based agentic pipeline for multi task public opinion analysis.<n>Unlike traditional methods, it offers an end-to-end, fully automated analytical workflow without requiring domain specific training data.<n>It enables timely, integrated public opinion analysis through a single natural language query.
arXiv Detail & Related papers (2025-05-16T16:09:28Z)
DICE: A Framework for Dimensional and Contextual Evaluation of Language Models [1.534667887016089]
Language models (LMs) are increasingly being integrated into a wide range of applications.<n>Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed.<n>We propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions.
arXiv Detail & Related papers (2025-04-14T16:08:13Z)
From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research [13.818244562506138]
Large Language Models (LLMs) provide a cost-effective and efficient alternative to human annotation.<n>This paper introduces the SILICON" (Systematic Inference with LLMs for Information Classification and Notation) workflow.<n>The workflow integrates established principles of human annotation with systematic prompt optimization and model selection.
arXiv Detail & Related papers (2024-12-19T02:21:41Z)
Benchmarking LLMs for Environmental Review and Permitting [10.214978239010849]
The National Environment Policy Act (NEPA) requires federal agencies to consider the environmental impacts of proposed actions.<n>Large Language Model (LLM)s' effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes.<n>We present NEPAQuAD, the first comprehensive benchmark derived from EIS documents.
arXiv Detail & Related papers (2024-07-10T02:33:09Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
LEARN: Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application [54.984348122105516]
Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework synergizes open-world knowledge with collaborative knowledge.<n>We propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge.
arXiv Detail & Related papers (2024-05-07T04:00:30Z)
Bayesian Preference Elicitation with Language Models [82.58230273253939]
We introduce OPEN, a framework that uses BOED to guide the choice of informative questions and an LM to extract features. In user studies, we find that OPEN outperforms existing LM- and BOED-based methods for preference elicitation.
arXiv Detail & Related papers (2024-03-08T18:57:52Z)
LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs) We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.