An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
- URL: http://arxiv.org/abs/2509.13471v1
- Date: Tue, 16 Sep 2025 19:13:26 GMT
- Title: An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software
- Authors: Sina Gogani-Khiabani, Ashutosh Trivedi, Diptikalyan Saha, Saeid Tizpaz-Niari,
- Abstract summary: Large language models (LLMs) show promise for translating natural-language statutes into executable logic.<n>We present an agentic approach for developing legal-critical software using U.S. federal tax preparation as a case study.
- Score: 7.672965856139587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.
Related papers
- Code Fingerprints: Disentangled Attribution of LLM-Generated Code [7.515488307576106]
We study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code.<n>We propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations.<n>We construct the first large-scale benchmark dataset comprising code generated by four widely used Large Language Models (LLMs) across four programming languages.
arXiv Detail & Related papers (2026-03-04T15:58:36Z) - SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements [0.36832029288386137]
This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) into software development.<n>A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project.<n>Retrieval-augmented generation (RAG) is implemented to enhance the relevance and precision of the revisions.
arXiv Detail & Related papers (2025-06-12T03:39:25Z) - Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z) - On the Potential and Limitations of Few-Shot In-Context Learning to
Generate Metamorphic Specifications for Tax Preparation Software [12.071874385139395]
Nearly 50% of taxpayers filed their individual income taxes using tax software in the U.S. in FY22.
This paper formulates the task of generating metamorphic specifications as a translation task between properties extracted from tax documents.
arXiv Detail & Related papers (2023-11-20T18:12:28Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Leveraging Large Language Models for Automated Proof Synthesis in Rust [6.202137610101939]
Large Language Models (LLMs) have shown success in code analysis and synthesis.
We present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus.
Our prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis.
arXiv Detail & Related papers (2023-11-07T05:47:47Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - Metamorphic Testing and Debugging of Tax Preparation Software [2.185694185279913]
We focus on an open-source tax preparation software for our case study.
We develop a randomized test-case generation strategy to systematically validate the correctness of tax preparation software.
arXiv Detail & Related papers (2022-05-10T16:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.