Fugu-MT 論文翻訳(概要): Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

論文の概要: Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

arxiv url: http://arxiv.org/abs/2307.05113v2
Date: Wed, 9 Aug 2023 12:08:46 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-10 17:10:00.268315
Title: Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark
Title（参考訳）: Go Beyond The Obvious: Detective Reasoning Puzzle Benchmarkによる人文科学とLLMの非形式的推論能力のギャップを探る
Authors: Zhouhon Gu, Zihan Li, Lin Zhang, Zhuozhi Xiong, Haoning Ye, Yikai Zhang, Wenhao Huang, Xiaoxuan Zhu, Qianyu He, Rui Xu, Sihang Jiang, Shusen Wang, Zili Wang, Hongwei Feng, Zhixu Li, Yanghua Xiao
Abstract要約: 本稿では、アクセス可能なオンラインリソースから収集した1200の質問の集合である検出推論ベンチマークを構築する。ベンチマークの欠如によるモデル非公式推論能力の向上を考慮し,人間の思考を模倣したセルフクエスト・プロンプト・フレームワークを提案する。実験結果から, 検出推論ベンチマークにおいて, 人間の性能がSoTA言語モデルよりも優れていたことが示唆された。
参考スコア（独自算出の注目度）: 32.52910329977459
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Informal reasoning ability is the ability to reason based on common sense, experience, and intuition.Humans use informal reasoning every day to extract the most influential elements for their decision-making from a large amount of life-like information.With the rapid development of language models, the realization of general artificial intelligence has emerged with hope. Given the outstanding informal reasoning ability of humans, how much informal reasoning ability language models have has not been well studied by scholars.In order to explore the gap between humans and language models in informal reasoning ability, this paper constructs a Detective Reasoning Benchmark, which is an assembly of 1,200 questions gathered from accessible online resources, aims at evaluating the model's informal reasoning ability in real-life context.Considering the improvement of the model's informal reasoning ability restricted by the lack of benchmark, we further propose a Self-Question Prompt Framework that mimics human thinking to enhance the model's informal reasoning ability.The goals of self-question are to find key elements, deeply investigate the connections between these elements, encourage the relationship between each element and the problem, and finally, require the model to reasonably answer the problem.The experimental results show that human performance greatly outperforms the SoTA Language Models in Detective Reasoning Benchmark.Besides, Self-Question is proven to be the most effective prompt engineering in improving GPT-4's informal reasoning ability, but it still does not even surpass the lowest score made by human participants.Upon acceptance of the paper, the source code for the benchmark will be made publicly accessible.
Abstract（参考訳）: インフォーマル推論能力は、常識、経験、直観に基づいて推論する能力であり、Humansは日常的に非公式な推論を使用して、大量の生命のような情報から意思決定に最も影響力のある要素を抽出し、言語モデルの急速な発展により、汎用人工知能の実現が期待されている。 Given the outstanding informal reasoning ability of humans, how much informal reasoning ability language models have has not been well studied by scholars.In order to explore the gap between humans and language models in informal reasoning ability, this paper constructs a Detective Reasoning Benchmark, which is an assembly of 1,200 questions gathered from accessible online resources, aims at evaluating the model's informal reasoning ability in real-life context.Considering the improvement of the model's informal reasoning ability restricted by the lack of benchmark, we further propose a Self-Question Prompt Framework that mimics human thinking to enhance the model's informal reasoning ability.The goals of self-question are to find key elements, deeply investigate the connections between these elements, encourage the relationship between each element and the problem, and finally, require the model to reasonably answer the problem.The experimental results show that human performance greatly outperforms the SoTA Language Models in Detective Reasoning Benchmark.Besides, Self-Question is proven to be the most effective prompt engineering in improving GPT-4's informal reasoning ability, but it still does not even surpass the lowest score made by human participants.Upon acceptance of the paper, the source code for the benchmark will be made publicly accessible.

関連論文リスト

Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis [3.711555701154055]
モデルの推論と実践的なAIチャットボットへの統合は、高度な数学、深い探索、そして抽出された質問応答問題の解決にブレークスルーをもたらした。しかし、これらのモデルが汎用言語モデル以上の幻覚を与える理由についての完全な理解が欠落している。本研究では,マルチホップ質問応答タスクにおける現代言語モデルの推論失敗を系統的に解明する。
論文参考訳（メタデータ） (2025-08-06T17:58:36Z)
Chain of Questions: Guiding Multimodal Curiosity in Language Models [2.0180882714261568]
質問の連鎖 (Chain of Questions, CoQ) は好奇心を駆使した推論手法であり, マルチモーダル言語モデルにより, 周辺環境に関する対象とする質問を生成する。我々は、WebGPT、ScienceQA、AVSD、ScanQAデータセットを統合した新しいマルチモーダルベンチマークデータセットについて、我々のフレームワークを評価する。
論文参考訳（メタデータ） (2025-08-06T11:42:54Z)
UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations [71.79210031338464]
会話における大規模言語モデルに対する高密度検索と応答生成の統一方法を示す。目的の異なる共同微調整を行い、不整合リスクを低減するための2つのメカニズムを設計する。 5つの対話型検索データセットの評価は、我々の統合モデルがタスクを相互に改善し、既存のベースラインより優れていることを示す。
論文参考訳（メタデータ） (2025-07-09T17:02:40Z)
Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
文脈と問合せの間のポイントワイドな相互情報は,言語モデルの性能向上に有効な指標であることを示す。本稿では,文書と質問のポイントワイドな相互情報を利用する2つの手法を提案する。
論文参考訳（メタデータ） (2024-11-12T13:14:09Z)
Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning [0.0]
言語モデル(LM)は、一貫してマルチホップ推論を行うのに苦労する。本稿では,LMアテンションヘッド上のターゲットメモリ注入によるマルチホップ推論障害をピンポイントし,修正する手法を提案する。
論文参考訳（メタデータ） (2024-11-06T16:30:26Z)
Claim Detection for Automated Fact-checking: A Survey on Monolingual, Multilingual and Cross-Lingual Research [7.242609314791262]
本稿では,現状の多言語クレーム検出研究を,問題の3つの重要な要因,妥当性,優先性,類似性に分類する。本稿では,既存の多言語データセットの概要と課題について概説し,今後の発展の可能性を提案する。
論文参考訳（メタデータ） (2024-01-22T14:17:03Z)
Teaching Smaller Language Models To Generalise To Unseen Compositional Questions [6.9076450524134145]
多様な推論能力を具現化するために,最大93タスクのマルチタスク事前学習の組み合わせを提案する。検索強化トレーニングデータセットを追加することにより,性能が大幅に向上することを示す。
論文参考訳（メタデータ） (2023-08-02T05:00:12Z)
Out-of-Domain Intent Detection Considering Multi-Turn Dialogue Contexts [91.43701971416213]
我々は,OODインテント検出タスクにおけるマルチターンコンテキストをモデル化するためのコンテキスト認識型OODインテント検出(Caro)フレームワークを提案する。 CaroはF1-OODスコアを29%以上改善することで、マルチターンOOD検出タスクの最先端性能を確立している。
論文参考訳（メタデータ） (2023-05-05T01:39:21Z)
Probing via Prompting [71.7904179689271]
本稿では,探索をプロンプトタスクとして定式化することで,新しいモデルフリーな探索手法を提案する。我々は5つの探索課題について実験を行い、我々のアプローチが診断プローブよりも情報抽出に優れていることを示す。次に,その特性に不可欠な頭部を除去し,言語モデリングにおけるモデルの性能を評価することにより,事前学習のための特定の言語特性の有用性を検討する。
論文参考訳（メタデータ） (2022-07-04T22:14:40Z)
Reinforcement Guided Multi-Task Learning Framework for Low-Resource Stereotype Detection [3.7223111129285096]
ステレオタイプ検出」データセットは主に、大規模な事前学習言語モデルに対する診断アプローチを採用している。信頼できるデータセットに注釈をつけるには、テキストでステレオタイプがどのように現れるかという微妙なニュアンスを正確に理解する必要がある。我々は「ステレオタイプ検出」における経験的性能を改善するために、データ豊富な隣接タスクの多元性を活用するマルチタスクモデルを提案する。
論文参考訳（メタデータ） (2022-03-27T17:16:11Z)
Fact-driven Logical Reasoning for Machine Reading Comprehension [82.58857437343974]
私たちは、常識と一時的な知識のヒントの両方を階層的にカバーする動機があります。具体的には,文の背骨成分を抽出し,知識単位の一般的な定式化を提案する。次に、事実単位の上にスーパーグラフを構築し、文レベル(事実群間の関係)と実体レベルの相互作用の利点を享受する。
論文参考訳（メタデータ） (2021-05-21T13:11:13Z)
Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
本稿では,タスク指向対話タスクにおいて,どのモデルが本質的に最も有意義な表現を担っているかを明らかにするために,事前学習された言語モデルについて検討する。我々は、アノテートラベルを教師付き方法で固定された事前学習言語モデルの上に、分類器プローブとしてフィードフォワード層を微調整する。
論文参考訳（メタデータ） (2020-10-26T21:34:39Z)
Knowledgeable Dialogue Reading Comprehension on Key Turns [84.1784903043884]
MRC(Multi-choice Machine reading comprehension)は、ある項目と質問に対する候補オプションから正しい回答を選択するモデルである。本研究は,複数回対話を行う対話型MRCに焦点を当てている。それは2つの課題に悩まされ、答えの選択決定は、最近役に立つコモンセンスをサポートせずに行われ、マルチターンコンテキストは、かなりの無関係な情報を隠蔽する可能性がある。
論文参考訳（メタデータ） (2020-04-29T07:04:43Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。