Fugu-MT 論文翻訳(概要): LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning

論文の概要: LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning

arxiv url: http://arxiv.org/abs/2506.13841v1
Date: Mon, 16 Jun 2025 16:23:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-18 17:34:59.186441
Title: LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning
Title（参考訳）: LocationReasoner: LLMs on Real-World Site Selection Reasoning
Authors: Miho Koda, Yu Zheng, Ruixian Ma, Mingyang Sun, Devesh Pansare, Fabio Duarte, Paolo Santi,
Abstract要約: 実世界のサイト選択の文脈において,大規模言語モデルの推論能力を評価するために設計されたベンチマークであるLocationReasonerを紹介する。このベンチマークは、制約ベースの位置情報検索のための社内ツールによってサポートされている、さまざまな難易度を持つ300以上の慎重に構築されたクエリで構成されている。大規模な評価は、最先端の推論モデルが、現実の文脈において、非合理的な前者よりも限られた改善をもたらすことを示している。
参考スコア（独自算出の注目度）: 12.265350534588817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation -- leaving open the question of whether such reasoning skills generalize to complex, real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs' reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistical constraints. The benchmark comprises over 300 carefully crafted queries of varying difficulty levels, supported by a sandbox environment with in-house tools for constraint-based location search. Extensive evaluations reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts, with even the latest OpenAI o4 model failing on 30% of site selection tasks. Moreover, agentic strategies such as ReAct and Reflexion often suffer from over-reasoning, leading to worse outcomes than direct code-generation prompting. With key limitations of LLMs in holistic and non-linear reasoning highlighted, we release LocationReasoner to foster the development of LLMs and agents capable of robust, grounded reasoning in real-world decision-making tasks. Codes and data for our benchmark are available at https://github.com/miho-koda/LocationReasoner.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩、特に強化後トレーニングによって強化されたものは、OpenAI o1やDeepSeek-R1といったモデルで例示されるように、印象的な推論能力を示している。しかしながら、これらの機能は、数学的な問題解決やコード生成といった領域で主にベンチマークされているため、そのような推論スキルが複雑な実世界のシナリオに一般化するかどうかという疑問が残る。本稿では,LLMの推論能力を実世界のサイト選択の文脈で評価するためのベンチマークであるLocationReasonerを紹介する。ベンチマークは、サンドボックス環境がサポートし、制約ベースのロケーションサーチのための社内ツールを備えた、さまざまな困難レベルのクエリを300以上慎重に作成する。最新のOpenAI o4モデルでさえ、30%のサイト選択タスクで失敗している。さらに、ReActやReflexionのようなエージェント戦略は、しばしば過剰な推論に悩まされ、直接的なコード生成よりも悪い結果をもたらす。全体的および非線形推論におけるLLMの鍵となる制限が強調された上で,実世界の意思決定タスクにおいて,LLMとエージェントの開発を促進するためにLocationReasonerをリリースする。ベンチマークのコードとデータはhttps://github.com/miho-koda/LocationReasoner.comで公開されている。

関連論文リスト

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
大きな言語モデル(LLM)は、人間のような思考を反映する印象的な推論能力を示している。既存の推論ベンチマークでは、ドメイン固有の知識(結晶化インテリジェンス)に焦点を当てるか、解釈可能性に欠ける。階層的認知フレームワークを基盤とした動的推論評価ベンチマークであるDRE-Benchを提案する。
論文参考訳（メタデータ） (2025-06-03T09:01:08Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
強化学習(RL)は近年,大規模言語モデル(LLM)の推論能力の向上に強い可能性を示している。本稿では,多分野にわたるLSM推論能力の向上を目的とした,新たなトレーニングパラダイムであるGeneral-Reasonerを提案する。私たちは一連のモデルをトレーニングし、物理学、化学、金融、電子工学など幅広い分野をカバーする幅広いデータセットでそれらを評価します。
論文参考訳（メタデータ） (2025-05-20T17:41:33Z)
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges [4.668749313973097]
本稿では,Large Language Models (LLMs) とLarge Reasoning Models (LRMs) を3段階の推論複雑性で体系的に評価する。モデルが直接、あるいはPython Code Interpreterによって応答する26の課題をキュレートします。 LRMは、様々な難易度を持つタスク間で堅牢なパフォーマンスを示し、しばしば従来の第一原理に基づく手法と競合する。
論文参考訳（メタデータ） (2025-05-16T18:32:35Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
本稿では,実運用指向のエンジニアリングシナリオから得られた100以上の質問をキュレートしたデータベースを提案する。このデータセットを用いて、4つの最先端の大規模言語モデル(LLM)を評価する。以上の結果から,LLMは時間的および構造的推論において強みを示すが,抽象的推論や形式的モデリング,文脈に敏感な工学的論理にはかなり苦労することがわかった。
論文参考訳（メタデータ） (2025-05-12T14:05:23Z)
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [7.379503137362718]
LR$2$Benchは,Long-chain Reflective Reasoning機能を評価するために設計された新しいベンチマークである。評価の結果,DeepSeek-R1 や OpenAI o1-preview のような先進的な LRM でさえ,LR$2$Bench のタスクと競合することが明らかとなった。
論文参考訳（メタデータ） (2025-02-25T04:51:17Z)
Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
ドメインドリフト下でのLarge Language Models (LLM) を用いた抽出質問応答(EQA)について検討する。性能ギャップを実証的に説明するための一連の実験を考案する。
論文参考訳（メタデータ） (2024-09-27T05:06:43Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
大規模言語モデル(LLM)は、問題解決と意思決定の能力の向上を示している。本稿ではメタ推論技術を必要とするプロセスベースのベンチマークMR-Benを提案する。メタ推論のパラダイムは,システム2のスロー思考に特に適しています。
論文参考訳（メタデータ） (2024-06-20T03:50:23Z)
DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy [76.58614128865652]
非決定性から決定性への進化として推論過程を再考する新しい視点であるDetermLRを提案する。まず、既知の条件を次の2つのタイプに分類する: 決定的および不決定的前提これは、推論プロセスのオール方向を提供し、不決定的データを段階的決定的洞察に変換する際のLCMを導く。我々は、利用可能な施設の保存と抽出、推論メモリによる推論パスの自動化、そしてその後の推論ステップに関する歴史的推論の詳細を保存する。
論文参考訳（メタデータ） (2023-10-28T10:05:51Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。