Related papers: Benchmarking LLM Agents for Wealth-Management Workflows

Benchmarking LLM Agents for Wealth-Management Workflows

URL: http://arxiv.org/abs/2512.02230v1
Date: Mon, 01 Dec 2025 21:56:21 GMT
Title: Benchmarking LLM Agents for Wealth-Management Workflows
Authors: Rory Milsom,
Abstract summary: This dissertation extends TheAgentCompany with a finance-focused environment.<n>It investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.

Related papers

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z)
Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs [21.656551146954587]
Large Language Models (LLMs) offer a path to automation.<n>We introduce a novel, structured dataset from 190 corporate reports.<n>Our results reveal a clear performance gap between qualitative and quantitative tasks.
arXiv Detail & Related papers (2025-12-30T15:28:03Z)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z)
How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes [5.848712585343904]
This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand.<n>Our benchmark combines two datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption.<n>Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends.
arXiv Detail & Related papers (2025-10-27T14:08:27Z)
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction [92.7392863957204]
FutureX is the largest and most diverse live benchmark for future prediction.<n>It supports real-time daily updates and eliminates data contamination through an automated pipeline for question gathering and answer collection.<n>We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools.
arXiv Detail & Related papers (2025-08-16T08:54:08Z)
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z)
Sustainability via LLM Right-sizing [21.17523328451591]
Large language models (LLMs) have become increasingly embedded in organizational.<n>This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks.<n>Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint.
arXiv Detail & Related papers (2025-04-17T04:00:40Z)
EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments [0.0699049312989311]
We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments.<n>We also propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents.
arXiv Detail & Related papers (2025-03-24T16:06:04Z)
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence.<n>WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.