Fugu-MT 論文翻訳(概要): Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

論文の概要: Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

arxiv url: http://arxiv.org/abs/2604.04386v1
Date: Mon, 06 Apr 2026 03:27:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.078384
Title: Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis
Title（参考訳）: 仮説駆動型誤差解析による問題の自動生成
Authors: Jiayu Fu, Mourad Heddaya, Chenhao Tan,
Abstract要約: LLMが抱える特定の数学概念とスキルを特定するために,AI生成仮説を用いた新しいベンチマーク生成パイプラインを提案する。我々のパイプラインは高度に適応可能であり、幅広いLLM能力を探索するために数学以外の応用が可能である。
参考スコア（独自算出の注目度）: 16.008582390875656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Numerous math benchmarks exist to evaluate LLMs' mathematical capabilities. However, most involve extensive manual effort and are difficult to scale. Consequently, they cannot keep pace with LLM development or easily provide new instances to mitigate overfitting. Some researchers have proposed automatic benchmark generation methods, but few focus on identifying the specific math concepts and skills on which LLMs are error-prone, and most can only generate category-specific benchmarks. To address these limitations, we propose a new math benchmark generation pipeline that uses AI-generated hypotheses to identify the specific math concepts and skills that LLMs struggle with, and then generates new benchmark problems targeting these weaknesses. Experiments show that hypothesis accuracy positively correlates with the difficulty of the generated problems: problems generated from the most accurate hypotheses reduce Llama-3.3-70B-Instruct's accuracy to as low as 45%, compared to 77% on the original MATH benchmark. Furthermore, our pipeline is highly adaptable and can be applied beyond math to explore a wide range of LLM capabilities, making it a valuable tool for investigating how LLMs perform across different domains.
Abstract（参考訳）: LLMの数学的能力を評価するために、多くの数学ベンチマークが存在する。しかし、ほとんどの場合、広範囲な手作業が伴い、拡張が困難である。そのため、LLM開発に遅れをとらず、オーバーフィッティングを緩和するための新しいインスタンスを簡単に提供できない。一部の研究者は自動ベンチマーク生成法を提案しているが、LSMがエラーを起こしやすい特定の数学の概念とスキルを特定することにはほとんど焦点を当てておらず、ほとんどの場合、カテゴリ固有のベンチマークしか生成できない。これらの制約に対処するために、AI生成仮説を用いた新しい数学ベンチマーク生成パイプラインを提案し、LLMが抱える特定の数学概念とスキルを特定し、これらの弱点をターゲットとした新しいベンチマーク問題を生成する。最も正確な仮説から生じる問題は、オリジナルのMATHベンチマークの77%と比較して、Llama-3.3-70B-Instructの精度を45%にまで低下させる。さらに、我々のパイプラインは高度に適応可能であり、幅広いLLM能力を探索するために数学以外の応用が可能であり、異なるドメイン間でLLMがどのように機能するかを調査するための貴重なツールである。

論文の概要: Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

関連論文リスト