Fugu-MT 論文翻訳(概要): GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

論文の概要: GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

arxiv url: http://arxiv.org/abs/2605.07053v1
Date: Fri, 08 May 2026 00:02:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.684381
Title: GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Title（参考訳）: GSM-SEM:Semantically Variant Augmentationsを生成するベンチマークとフレームワーク
Authors: Jyotika Singh, Fang Tu, Aziza Mirzadova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Yassine Benajiba, Weiyi Sun, Graham Horwood, Sujith Ravi, Dan Roth,
Abstract要約: GSM-SEMは、意味的に多様なベンチマーク変種を生成するための再利用可能なフレームワークである。 GSM8K-SEM, GSM-Symbolic-SEM, GSM-Plus-SEMをGSM8Kと既存の2種類のバリエーションスイート上で生成する。 SEMの3つの変種を、完全に人間検証されたデータセットとしてリリースします。
参考スコア（独自算出の注目度）: 36.78194119255125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.
Abstract（参考訳）: GSM8Kのようなベンチマークは、数学的推論の一般的な尺度であるが、リーダーボードのゲインは、固定されたテストセットの記憶のために真の能力を誇張することができる。ほとんどのロバストな変種は、表面レベルの摂動(言い換え、リネーム、番号スワップ、イントラクタ)を適用し、基礎となる事実をほとんど保存し、静的なリリース自体が時間の経過とともに記憶のターゲットとなる。 GSM-SEM(GSM-SEM)は,従来のアプローチよりもはるかに高いセマンティック分散を持つセマンティックなベンチマーク変種を生成するための,再利用可能な確率的フレームワークである。 GSM-SEMは、エンティティ、属性、および/または関係を変更し、基礎となる事実を頻繁に変更し、新しい条件下で解を再計算するモデルを必要とする。 GSM-SEMは、再アノテーションを必要とせずに各ランで新しい変種を生成し、評価のための静的な公開ベンチマークへの依存を減らし、メモリ化のバイアスを小さくする。 GSM8KとGSM-SEMにGSM-SEMを適用し、GSM8K-SEM、GSM-Symbolic-SEM、GSM-Plus-SEMを生成する。意味的摂動と記号的・余分な変動(GSM-SEMの最大厳密度構成では平均28%)が組み合わさった場合,一貫した性能低下を観測した。 SEMの3つの変種を、完全に人間検証されたデータセットとして公開しています。最後に, GSM-SEMをBigBenchHard, LogicBench, NLR-BIRDなどのベンチマークに適用する。

論文の概要: GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

関連論文リスト