Fugu-MT 論文翻訳(概要): WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

論文の概要: WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2509.23219v1
Date: Sat, 27 Sep 2025 09:58:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.108076
Title: WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
Title（参考訳）: WirelessMathLM:強化学習による無線通信におけるLLMの数学的推論
Authors: Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen,
Abstract要約: 大規模言語モデル(LLM)は、一般的な数学的推論では優れているが、専門的な技術的数学では破滅的に失敗する。無線通信では、問題は情報理論的境界の正確な操作を必要とするが、最先端のモデルでさえ有能な性能を達成するのに苦労する。本稿では、コンパクトモデル(0.5B-7Bパラメータ)がドメイン固有強化学習により、より大きなモデルに適合または超えることを示す。
参考スコア（独自算出の注目度）: 51.13280433665446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.
Abstract（参考訳）: 大規模言語モデル(LLM)は、一般的な数学的推論では優れているが、専門的な技術的数学では破滅的に失敗する。無線通信では、情報理論的境界、最適化の制約、信号処理の定式化といった問題を正確に操作する必要があるが、最先端のモデルでさえ、有能な性能を達成するのに苦労している。本稿では,コンパクトモデル(0.5B-7Bパラメータ)がドメイン固有強化学習と検証可能な報酬によって,はるかに大きなモデルに一致するか,あるいは超えられることを示す。我々の重要な洞察は、ワイヤレス数学の問題は、人間のフィードバックなしに効果的な強化学習を可能にする、独特な性質、検証可能な正しさを持っているということである。我々は970論文から4,027件の総合的なベンチマークである WirelessMathBench-XL を構築した。グループ相対政策最適化 (GRPO) とバイナリ検証報酬を用いて, 温暖化開始を監督せずに, ベースチェックポイントから直接モデルを訓練する。我々の 7B モデルは WirelessMathBench-XL で 39.5% の精度を実現し、GPT-4o (40.4%) に近づき、DeepSeek-R1 (671B, 57.4%) の約100倍のパラメータを使用した。 GRPOトレーニングはすべてのモデルスケール(0.5B +11%, 3B + 103%, 7B + 81%)のパフォーマンスをほぼ2倍にし、一般的な数学のベンチマークに積極的に移行した。

論文の概要: WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

関連論文リスト