Fugu-MT 論文翻訳(概要): UQ: Assessing Language Models on Unsolved Questions

論文の概要: UQ: Assessing Language Models on Unsolved Questions

arxiv url: http://arxiv.org/abs/2508.17580v1
Date: Mon, 25 Aug 2025 01:07:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.595097
Title: UQ: Assessing Language Models on Unsolved Questions
Title（参考訳）: UQ:未解決の質問に対する言語モデルの評価
Authors: Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff,
Abstract要約: 当社では,Stack Exchangeから提供された500の難解で多様な質問を対象としたテストベッドであるUQを紹介します。未解決の質問は、人間が答えを求めるときにしばしば難しく自然に発生する。上位モデルは15%の質問でUQ検証をパスし、予備的な人間の検証はすでに正しい答えを同定している。
参考スコア（独自算出の注目度）: 149.46593270027697
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.
Abstract（参考訳）: ベンチマークはAI研究の進歩を形作る。質問は、実際の使用状況を反映しながら、フロンティアモデルに挑戦すべきである。テストスタイルのベンチマークは、実世界の限られた価値で人工的に困難にされることが多いが、実際のユーザインタラクションに基づくベンチマークは、簡単で高周波な問題に苦しむことが多い。この研究では、未解決の質問に対するモデルの評価という、根本的に異なるパラダイムを探求する。静的なベンチマークではなく、未解決の質問をキュレートし、バリデータによるスクリーニングとコミュニティ検証によってモデルを非同期に評価する。 CS理論や数学からSFや歴史、推論、事実性、ブラウジングといった調査機能まで、さまざまなトピックを取り上げています。未解決の質問は、人間が答えを求めるときにしばしば難しく、自然に生じます。 UQ-Datasetとそのコレクションパイプラインは,ルールベースのフィルタ,LCM判断器,人間レビューを組み合わせた質問品質(例えば,明確に定義された,難しい)の確保,UQ-Validator,ジェネレータとバリケータのギャップを利用した複合検証戦略,評価信号とプレスクリーン候補ソリューションの提供,そして,専門家が質問やソリューションをまとめて検証するオープンプラットフォームであるUQ-Platformである。上位モデルは15%の質問でUQ検証をパスし、予備的な人間の検証はすでに合格した人の正しい答えを特定している。 UQは、人間の知識のフロンティアを後押しする、現実的でオープンな課題におけるフロンティアモデルを評価するためのパスをグラフ化します。 UQはhttps://uq.stanford.edu.comで公開しています。

論文の概要: UQ: Assessing Language Models on Unsolved Questions

関連論文リスト