Fugu-MT 論文翻訳(概要): CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

論文の概要: CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

arxiv url: http://arxiv.org/abs/2606.06526v1
Date: Tue, 02 Jun 2026 20:38:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.343541
Title: CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
Title（参考訳）: CrowdMath: クラウドソーシングによる数学的研究に関する議論のデータセット
Authors: Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky,
Abstract要約: 我々は、MIT PRIMES--Art of Problem Solving (AoPS)プログラムから164のエキスパートアノテートプログレスチェーンのデータセットであるCrowdMathを紹介する。各チェーンは、オープンプロブレムステートメントから完成した証明まで、多人数のフォーラムディスカッションをトレースする。モデルは次のポスト予測において83～88%の精度を達成し、数学的議論の局所的な流れに従うことができることを示唆している。
参考スコア（独自算出の注目度）: 7.449578020792231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.
Abstract（参考訳）: 大規模な言語モデルは、数学的推論においてかなりの進歩を遂げてきたが、既存のベンチマークでは、最終解、ステップバイステップの解、あるいは完全な証明で、よく特定された問題を評価するのが一般的である。参加者が部分的な議論を提案し、事前ステップのギャップやエラーを特定し、欠陥のある推論を修復し、徐々に証明に貢献する、という設定です。我々は、MIT PRIMES-Art of Problem Solving (AoPS) CrowdMath Program (2016-2025) から164のエキスパートアノテートプログレスチェーンのデータセットであるCrowdMathを紹介した。各チェーンは、オープンプロブレムステートメントから完成した証明まで、多人数のフォーラムディスカッションをトレースする。ポストは、部分進行、証明完了、誤った推論、誤り識別など、進化するソリューションプロセスにおけるそれらの機能的役割によってラベル付けされる。評価タスクを定義し、6つのフロンティアモデルをベンチマークする。モデルは次のポスト予測において83～88%の精度を達成し、数学的議論の局所的な流れに従うことができることを示唆している。しかし、彼らは個々のコントリビューションの機能的意義を最良のモデルで識別するのに苦労し、ポストロール分類では0.42マクロF1しか達成できなかった。 CrowdMathは、明確に定義された数学的問題を解くことと、それが広がるにつれて協調的な数学的進歩を理解することのギャップを露呈する。

論文の概要: CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

関連論文リスト