Fugu-MT 論文翻訳(概要): Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models

論文の概要: Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models

arxiv url: http://arxiv.org/abs/2506.13681v2
Date: Thu, 19 Jun 2025 04:41:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-23 12:57:34.502085
Title: Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models
Title（参考訳）: Min-p, Max Exaggeration:言語モデルにおけるMin-pサンプリングの批判的分析
Authors: Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch,
Abstract要約: 本論文は, min-pを支持する証拠の包括的再検討を行い, 元の論文の4行の証拠と異なる結論に達した。元の論文で示された証拠は、min-pが品質、多様性、あるいは品質と多様性のトレードオフを改善するという主張を支持していないと結論づける。
参考スコア（独自算出の注目度）: 4.76181732448268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence. First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity.
Abstract（参考訳）: 言語モデルからのサンプリングはアウトプットの品質と多様性に影響を与え、研究と実世界のアプリケーションの両方に影響を及ぼす。 Nguyen et al 2024 の "Turning Up the Heat: Min-p Smpling for Creative and Coherent LLM Outputs" では min-p と呼ばれる新しいサンプルが紹介されている。これらの主張の重要性は、ICLR 2025への18回目の最高評価の提出と、口頭でのプレゼンテーションの選択によって裏付けられた。本論文は, min-pを支持する証拠の包括的再検討を行い, 元の論文の4行の証拠と異なる結論に達した。まず,本論文の人的評価を省略し, 統計的検査を誤って実施し, 定性的フィードバックを不正確に記述し, 再解析により, min-pは品質, 多様性, 品質と多様性のトレードオフを上回りませんでした。第二に、元の論文のNLPベンチマークを包括的に網羅すると、min-pはハイパーパラメータ数の制御においてベースラインを超えないことが明らかになった。第3に,本論文のLCM-as-a-Judgeの評価では,方法論的明確さが欠如しており,矛盾が指摘されている。第4に、コミュニティ採用の主張(GitHubリポジトリ49万、GitHubスター1100万)は根拠がなく、削除された。元の論文で示された証拠は、min-pが品質、多様性、あるいは品質と多様性のトレードオフを改善するという主張を支持していないと結論づける。

関連論文リスト

Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark [27.134554623769898]
推論に基づくポーズ推定(RPE)ベンチマークは、ポーズ対応大規模言語モデル(MLLM)の広く採用されている評価標準として登場した。公平で一貫した定量的評価を妨げる批判的かつベンチマーク品質の問題を特定しました。
論文参考訳（メタデータ） (2025-07-17T17:33:11Z)
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs [3.631341123338476]
大規模言語モデル(LLM)は、各復号ステップにおける語彙上の確率分布から次のトークンをサンプリングしてテキストを生成する。本稿では,トップトークンの確率をスケーリング係数として利用して,モデルの信頼度に基づいてサンプリングしきい値を調整する動的トランケーション手法であるmin-pサンプリングを提案する。
論文参考訳（メタデータ） (2024-07-01T08:37:25Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
我々はtextscLlama-2 や textscMistral のような大規模言語モデル (LLM) のための新しい評価フレームワークを提案する。このアプローチにより、コーパスの整合を必要とせず、生成したテキストの品質と多様性を微妙に評価できる。
論文参考訳（メタデータ） (2024-02-16T13:53:26Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instructは、疑似参照でポイントワイズした批評を取得し、マルチパスプロンプトを通じてこれらの批評を修正できる。 CritiqueLLMは、ChatGPTとすべてのオープンソースベースラインを上回るように実証的に示されています。
論文参考訳（メタデータ） (2023-11-30T16:52:42Z)
With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector [4.636982694364995]
本研究は,Vamvas and Sennrich (2022) の論文で提示された人体評価実験の結果を再現し, オーバートランスレーションとアンダートランスレーションを検出する自動システムの評価を行った。著者らが提供したドキュメンテーションやコードの品質は高いが、正確な実験的なセットアップを再現し、改善のためのレコメンデーションを提供する際に見つかったいくつかの問題について議論する。
論文参考訳（メタデータ） (2023-08-12T11:00:59Z)
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP [84.08476873280644]
13%の論文は (i) 再生の障壁が十分に低く、 (ii) 再生のために考慮すべき十分な入手可能な情報を持っていた。その結果,コーディネート・リサーチ・デザインを再現的アプローチから標準化的・再生産的アプローチに変更しなければならなかった。
論文参考訳（メタデータ） (2023-05-02T17:46:12Z)
Evaluating and Improving Factuality in Multimodal Abstractive Summarization [91.46015013816083]
そこで我々は,CLIPBERTScoreを提案する。ゼロショットにおけるこの2つの指標の単純な組み合わせは、文書要約のための既存の事実度指標よりも高い相関性が得られることを示す。本分析は,CLIPBERTScoreとそのコンポーネントの信頼性と高い相関性を示す。
論文参考訳（メタデータ） (2022-11-04T16:50:40Z)
Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment [26.30237757653724]
コンファレンスピアレビューにおいて不整合性を検討した2014 NeurIPS 実験を再検討した。強調された論文では,品質スコアと紙の影響には相関がないことがわかった。
論文参考訳（メタデータ） (2021-09-20T18:06:22Z)
On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation [86.11292297348622]
本研究では, 品質と多様性の線形結合が, 生成した分布と実分布との分岐距離を構成することを示す。品質/多様性メトリックペアの代替としてCR/NRRを提案する。
論文参考訳（メタデータ） (2020-07-03T04:06:59Z)
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [25.299937353444854]
本研究は,参照要約の存在を伴わない,弱教師付き要約評価手法に対する概念実証研究である。既存の要約データセットの大量データは、文書と破損した参照要約とのペアリングによってトレーニングのために変換される。
論文参考訳（メタデータ） (2020-05-13T15:40:13Z)
Self-Adversarial Learning with Comparative Discrimination for Text Generation [111.18614166615968]
本稿では,テキスト生成におけるGANの性能向上のための,新たな自己逆学習(SAL)パラダイムを提案する。トレーニング中、SALは、現在生成された文が以前生成されたサンプルより優れていると判断されたときにジェネレータに報酬を与える。テキスト生成ベンチマークデータセットの実験により,提案手法は品質と多様性の両方を大幅に改善することが示された。
論文参考訳（メタデータ） (2020-01-31T07:50:25Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。