Related papers: TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

URL: http://arxiv.org/abs/2507.16126v1
Date: Tue, 22 Jul 2025 00:37:59 GMT
Title: TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task
Authors: Michael R. Bock, Kara Molisee, Zachary Ozer, Sumit Shah,
Abstract summary: Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text.<n>We propose TaxCalcBench, a benchmark for determining models' abilities to calculate personal income tax returns.
Score: 0.11999555634662631
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. We propose TaxCalcBench, a benchmark for determining models' abilities to calculate personal income tax returns given all of the necessary information. Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set. Our analysis concludes that models consistently misuse tax tables, make errors in tax calculation, and incorrectly determine eligibility. Our findings point to the need for additional infrastructure to apply LLMs to the personal income tax calculation task.

Related papers

TaxAgent: How Large Language Model Designs Fiscal Policy [22.859190941594296]
This study introduces TaxAgent, a novel integration of large language models (LLMs) with agent-based modeling (ABM) to design adaptive tax policies.<n>In our macroeconomic simulation, heterogeneous H-Agents (households) simulate real-world taxpayer behaviors while the TaxAgent (government) utilizes LLMs to iteratively optimize tax rates, balancing equity and productivity.<n> Benchmarked against Saez Optimal Taxation, U.S. federal income taxes, and free markets, TaxAgent achieves superior equity-efficiency trade-offs.
arXiv Detail & Related papers (2025-06-03T13:06:19Z)
A Taxation Perspective for Fair Re-ranking [61.946428892727795]
We introduce a new fair re-ranking method named Tax-rank, which levies taxes based on the difference in utility between two items. Our model Tax-rank offers a superior tax policy for fair re-ranking, theoretically demonstrating both continuity and controllability over accuracy loss.
arXiv Detail & Related papers (2024-04-27T08:21:29Z)
Learning Optimal Tax Design in Nonatomic Congestion Games [56.85292809260111]
In multiplayer games, self-interested behavior among the players can harm the social welfare.<n>We take the initial step of learning the optimal tax that can induce social welfare with limited feedback in congestion games.
arXiv Detail & Related papers (2024-02-12T06:32:53Z)
On the Potential and Limitations of Few-Shot In-Context Learning to Generate Metamorphic Specifications for Tax Preparation Software [12.071874385139395]
Nearly 50% of taxpayers filed their individual income taxes using tax software in the U.S. in FY22. This paper formulates the task of generating metamorphic specifications as a translation task between properties extracted from tax documents.
arXiv Detail & Related papers (2023-11-20T18:12:28Z)
OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? [50.46167465931653]
The authors explain where OpenAI got the tax law example in its livestream demonstration of GPT-4. They also explain how GPT-4 got the wrong answer and how it fails to reliably calculate taxes.
arXiv Detail & Related papers (2023-09-15T20:00:27Z)
Algorithmic Fairness and Vertical Equity: Income Fairness with IRS Tax Audit Models [73.24381010980606]
This study examines issues of algorithmic fairness in the context of systems that inform tax audit selection by the IRS. We show how the use of more flexible machine learning methods for selecting audits may affect vertical equity. Our results have implications for the design of algorithmic tools across the public sector.
arXiv Detail & Related papers (2022-06-20T16:27:06Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
Tax Knowledge Graph for a Smarter and More Personalized TurboTax [0.0]
We will share our innovative and practical approach to representing complicated U.S. and Canadian income tax compliance logic via a large-scale knowledge graph. We will cover how the Tax Knowledge Graph is constructed and automated, how it is used to calculate tax refunds, reasoned to find missing info, and navigated to explain the calculated results.
arXiv Detail & Related papers (2020-09-13T22:41:01Z)
The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies [119.07163415116686]
We train social planners that discover tax policies that can effectively trade-off economic equality and productivity. We present an economic simulation environment that features competitive pressures and market dynamics. We show that AI-driven tax policies improve the trade-off between equality and productivity by 16% over baseline policies.
arXiv Detail & Related papers (2020-04-28T06:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.