Fugu-MT 論文翻訳(概要): Benchmarking World-Model Learning

論文の概要: Benchmarking World-Model Learning

arxiv url: http://arxiv.org/abs/2510.19788v1
Date: Wed, 22 Oct 2025 17:23:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:16.226047
Title: Benchmarking World-Model Learning
Title（参考訳）: ワールドモデル学習のベンチマーク
Authors: Archana Warrier, Dat Nyugen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares,
Abstract要約: 我々は、異なるが関連する環境において、評価されたフェーズを分離するモデル学習エージェントを評価するためのプロトコルであるWorldTestを提案する。秋のベンチでは,517人の参加者と3つのモデルを比較した。人間はいくつかの環境でモデルより優れていますが、他の環境では優れていません。
参考スコア（独自算出の注目度）: 47.19639484216151
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
Abstract（参考訳）: モデル学習エージェントは、観測されていない状態の予測、行動の近時および長期的な結果の推定、アクションシーケンスの計画、ダイナミクスの変化の検出など、多くの下流タスクと推論をサポートする世界モデルを学ぶための情報を集める必要がある。トレーニングと評価は次のフレームの予測に固定され、同じ環境における報酬の最大化によって成功が評価される。我々は,評価されたテストフェーズから報酬のないインタラクションを分離する,モデル学習エージェントを評価するためのプロトコルであるWorldTestを提案する。 WorldTest is open-ended$\unicode{x2014}$models should support many different tasks before time$\unicode{x2014}$and agnostic to model representation。 43のインタラクティブなグリッドワールド環境と3つのファミリーにまたがる129のタスクからなるスイートであるAtumunBenchでWorldTestをインスタンス化した。 AutumnBenchで517人の参加者と3つのフロンティアモデルを比較した。人間はモデルより優れており、スケーリング計算は一部の環境でのみ性能を改善するが、他の環境では改善しない。 WorldTestは、新しいテンプレート$\unicode{x2014}$reward-free Explor, derived test, and behavior-based score$\unicode{x2014}$to evaluate what the agent learn about environment dynamics, and AutumnBenchは、ワールドモデル学習において重要なヘッドルームを公開する。

関連論文リスト

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training [15.675086189757769]
MMEarth-Benchは、12のモダリティ、グローバル分散データ、およびin-out-of-distriionテストの分割を持つ5つの新しいマルチモーダル環境タスクのコレクションである。我々は、事前訓練されたモデルの多様なセットをベンチマークし、(マルチモーダルな)事前訓練は、限られたデータ設定におけるモデルの堅牢性を改善する傾向にあるが、地理的一般化能力は貧弱であることを示した。本稿では,テスト時に利用できるすべてのモダリティを補助的タスクとして利用するマルチモーダル再構成(TTT-MMR)を用いたテストタイムトレーニングのモデルに依存しない手法を提案する。
論文参考訳（メタデータ） (2026-02-06T00:48:19Z)
Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback [51.22403664895878]
Agent2Worldは、強力な推論時ワールドモデル生成を実現するツール拡張マルチエージェントフレームワークである。また、マルチエージェントフィードバックの生成を基盤にすることで、教師付き微調整のためのデータエンジンとしても機能する。
論文参考訳（メタデータ） (2025-12-26T18:54:14Z)
Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models [37.774994737939394]
我々は動的モデルを用いて、合成データと推論時間検証を用いて世界モデルをブートストラップする。 GPT4o-as-judgeによると、我々の最良のモデルは、最先端の画像編集モデルと性能を競い合っており、実世界のサブセットでは15%のマージンで改善されている。
論文参考訳（メタデータ） (2025-06-06T11:50:18Z)
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning [52.36434784963598]
我々は、異なるAIモデルのワールドモデリングと手続き計画能力を評価するためのビデオベースのベンチマークであるWorldPredictionを紹介する。現在のフロンティアモデルでは,WorldPrediction-WMでは57%,WorldPrediction-PPでは38%の精度しか達成できないが,人間は両タスクを完璧に解くことができる。
論文参考訳（メタデータ） (2025-06-04T18:22:40Z)
Exploration-Driven Generative Interactive Environments [53.05314852577144]
我々は、低コストで自動収集されたインタラクションデータに多くの仮想環境を使用することに重点を置いている。仮想環境におけるランダムエージェントのみを用いたトレーニングフレームワークを提案する。我々のエージェントは環境固有の報酬に完全に依存しているため、新しい環境に容易に適応できる。
論文参考訳（メタデータ） (2025-04-03T12:01:41Z)
Can foundation models actively gather information in interactive environments to test hypotheses? [43.42688356541211]
基礎モデルはシングルターン推論において優れているが、動的環境におけるマルチターン探索に苦慮している。これらのモデルを,経験から学び,適応し,情報を収集する能力に基づいて評価した。
論文参考訳（メタデータ） (2024-12-09T12:27:21Z)
STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning [82.03481509373037]
近年,モデルに基づく強化学習アルゴリズムは視覚入力環境において顕著な有効性を示している。本稿では,強力なモデリングと生成機能を組み合わせた効率的な世界モデルアーキテクチャであるTransformer-based wORld Model (STORM)を紹介する。 Stormは、Atari 100$kベンチマークで平均126.7%の人的パフォーマンスを達成し、最先端のメソッドの中で新しい記録を樹立した。
論文参考訳（メタデータ） (2023-10-14T16:42:02Z)
A Control-Centric Benchmark for Video Prediction [69.22614362800692]
本稿では,アクション条件付きビデオ予測のベンチマークを,制御ベンチマークの形式で提案する。私たちのベンチマークには、11のタスクカテゴリと310のタスクインスタンス定義を備えたシミュレーション環境が含まれています。次に、ベンチマークを活用して、スケールするモデルサイズ、トレーニングデータの量、モデルアンサンブルの影響を調査します。
論文参考訳（メタデータ） (2023-04-26T17:59:45Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。