Related papers: Reasoning Models Ace the CFA Exams

Reasoning Models Ace the CFA Exams

URL: http://arxiv.org/abs/2512.08270v1
Date: Tue, 09 Dec 2025 05:57:19 GMT
Title: Reasoning Models Ace the CFA Exams
Authors: Jaisal Patel, Yunzhe Chen, Kaiwen He, Keyi Wang, David Li, Kairong Xiao, Xiao-Yang Liu,
Abstract summary: We evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three levels.<n>The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1.
Score: 6.899142543217881
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA) [43.53870250026015]
We benchmark five large language models (LLMs) on the International Olympiad on Astronomy and Astrophysics (IOAA) exams.<n>With average scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 rank in the top two among 200-300 participants in all four IOAA theory exams evaluated.<n>GPT-5 still excels in the exams with an 88.5% average score, ranking top 10 among the participants in the four most recent IOAAs.
arXiv Detail & Related papers (2025-10-06T16:58:47Z)
Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III [0.0]
This paper presents a benchmark evaluating 23 state-of-the-art Large Language Models (LLMs) on the Chartered Financial Analyst (CFA) Level III exam.<n>We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover.<n>Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III.
arXiv Detail & Related papers (2025-06-29T19:54:57Z)
Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis [0.0]
o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen questions from the Royal College of General Practitioners GP SelfTest.<n>The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively.
arXiv Detail & Related papers (2025-06-03T15:25:38Z)
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation [75.33671166231096]
We introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench)<n>RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing.<n>We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc.
arXiv Detail & Related papers (2025-05-04T07:48:36Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
Evaluating the Performance of Large Language Models for Spanish Language in Undergraduate Admissions Exams [4.974500659156055]
This study evaluates the performance of large language models, specifically GPT-3.5 and BARD, in undergraduate admissions exams proposed by the National Polytechnic Institute in Mexico. Both models demonstrated proficiency, exceeding the minimum acceptance scores for respective academic programs to up to 75% for some academic programs.
arXiv Detail & Related papers (2023-12-28T06:23:39Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)
Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams [4.2706617195518195]
This study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests. This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge. The best-performing model, GPT-4 with Chain-of-Thought prompts, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points.
arXiv Detail & Related papers (2023-03-29T20:10:13Z)
GPT Takes the Bar Exam [0.0]
We document our experimental evaluation of the performance of OpenAI's text-davinci-003 model, often-referred to as GPT-3.5. For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam.
arXiv Detail & Related papers (2022-12-29T18:19:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.