SteuerLLM: Local specialized large language model for German tax law analysis
- URL: http://arxiv.org/abs/2602.11081v1
- Date: Wed, 11 Feb 2026 17:46:01 GMT
- Title: SteuerLLM: Local specialized large language model for German tax law analysis
- Authors: Sebastian Wind, Jeta Sopa, Laurin Schmid, Quirin Jackl, Sebastian Kiefer, Fei Wu, Martin Mayr, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh,
- Abstract summary: Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules.<n>We generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations.<n>We present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset.
- Score: 8.82402339973647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.
Related papers
- LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval [5.305110876082343]
CaseFacts is a benchmark for verifying legal claims against U.S. Supreme Court precedents.<n>The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled.
arXiv Detail & Related papers (2026-01-23T23:41:46Z) - An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software [7.672965856139587]
Large language models (LLMs) show promise for translating natural-language statutes into executable logic.<n>We present an agentic approach for developing legal-critical software using U.S. federal tax preparation as a case study.
arXiv Detail & Related papers (2025-09-16T19:13:26Z) - Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [78.1575956773948]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z) - Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study [0.0]
This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law.
arXiv Detail & Related papers (2025-07-11T10:19:56Z) - RLJP: Legal Judgment Prediction via First-Order Logic Rule-enhanced with Large Language Models [58.69183479148083]
Legal Judgment Prediction (LJP) is a pivotal task in legal AI.<n>Existing LJP models integrate judicial precedents and legal knowledge for high performance.<n>But they neglect legal reasoning logic, a critical component of legal judgments requiring rigorous logical analysis.<n>This paper proposes a rule-enhanced legal judgment prediction framework based on first-order logic (FOL) formalism and comparative learning (CL)
arXiv Detail & Related papers (2025-05-27T14:50:21Z) - LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z) - Technical Challenges in Maintaining Tax Prep Software with Large Language Models [6.419602857618507]
We focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs)<n>Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging ChatGPT and Llama to faithfully extract code differentials from IRS publications.
arXiv Detail & Related papers (2025-04-25T21:00:20Z) - Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties [5.185522256407782]
We introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties.<n>Our experiments with six LLMs reveal that their baseline capabilities are limited, especially when dealing with conflicting issues that demand a comprehensive understanding.
arXiv Detail & Related papers (2025-03-05T12:24:20Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.