Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models
- URL: http://arxiv.org/abs/2307.05113v3
- Date: Wed, 20 Mar 2024 11:56:52 GMT
- Title: Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models
- Authors: Zhouhong Gu, Lin Zhang, Jiangjie Chen, Haoning Ye, Xiaoxuan Zhu, Zihan Li, Zheyu Ye, Yan Gao, Yao Hu, Yanghua Xiao, Hongwei Feng,
- Abstract summary: Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases.
We introduce the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning.
To enhance model's detective skills, we propose the Detective Thinking Framework. These methods encourage models to identify all possible clues within the context before reasoning.
- Score: 44.42887452269389
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases, especially when confronted with a vast amount of information. With the rapid development of large language models~(LLMs), evaluating how these models identify key information and reason to solve questions becomes increasingly relevant. We introduces the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning when facing complex and implicit information. The DetectBench comprises 3,928 questions, each paired with a paragraph averaging 190 tokens in length. To enhance model's detective skills, we propose the Detective Thinking Framework. These methods encourage models to identify all possible clues within the context before reasoning. Our experiments reveal that existing models perform poorly in both information detection and multi-hop reasoning. However, the Detective Thinking Framework approach alleviates this issue.
Related papers
- Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis [3.711555701154055]
Reasoning models and their integration into practical AI chat bots have led to breakthroughs in solving advanced math, deep search, and extractive question answering problems.<n>Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing.<n>In this study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks.
arXiv Detail & Related papers (2025-08-06T17:58:36Z) - Chain of Questions: Guiding Multimodal Curiosity in Language Models [2.0180882714261568]
Chain of Questions (CoQ) is a curiosity-driven reasoning approach that encourages multimodal language models to generate targeted questions regarding their surroundings.<n>We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets.
arXiv Detail & Related papers (2025-08-06T11:42:54Z) - UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations [71.79210031338464]
We show how to unify dense retrieval and response generation for large language models in conversation.<n>We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks.<n>The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
arXiv Detail & Related papers (2025-07-09T17:02:40Z) - Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance.
We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning [0.0]
Language models (LMs) struggle to perform multi-hop reasoning consistently.
We propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LM attention heads.
arXiv Detail & Related papers (2024-11-06T16:30:26Z) - Claim Detection for Automated Fact-checking: A Survey on Monolingual, Multilingual and Cross-Lingual Research [7.242609314791262]
We present state-of-the-art multilingual claim detection research categorized into three key factors of the problem, verifiability, priority, and similarity.
We present a detailed overview of the existing multilingual datasets along with the challenges and suggest possible future advancements.
arXiv Detail & Related papers (2024-01-22T14:17:03Z) - Teaching Smaller Language Models To Generalise To Unseen Compositional
Questions [6.9076450524134145]
We propose a combination of multitask pretraining on up to 93 tasks designed to instill diverse reasoning abilities.
We show that performance can be significantly improved by adding retrieval-augmented training datasets.
arXiv Detail & Related papers (2023-08-02T05:00:12Z) - Out-of-Domain Intent Detection Considering Multi-Turn Dialogue Contexts [91.43701971416213]
We introduce a context-aware OOD intent detection (Caro) framework to model multi-turn contexts in OOD intent detection tasks.
Caro establishes state-of-the-art performances on multi-turn OOD detection tasks by improving the F1-OOD score of over $29%$ compared to the previous best method.
arXiv Detail & Related papers (2023-05-05T01:39:21Z) - Probing via Prompting [71.7904179689271]
This paper introduces a novel model-free approach to probing, by formulating probing as a prompting task.
We conduct experiments on five probing tasks and show that our approach is comparable or better at extracting information than diagnostic probes.
We then examine the usefulness of a specific linguistic property for pre-training by removing the heads that are essential to that property and evaluating the resulting model's performance on language modeling.
arXiv Detail & Related papers (2022-07-04T22:14:40Z) - Reinforcement Guided Multi-Task Learning Framework for Low-Resource
Stereotype Detection [3.7223111129285096]
"Stereotype Detection" datasets mainly adopt a diagnostic approach toward large Pre-trained Language Models.
Annotating a reliable dataset requires a precise understanding of the subtle nuances of how stereotypes manifest in text.
We present a multi-task model that leverages the abundance of data-rich neighboring tasks to improve the empirical performance on "Stereotype Detection"
arXiv Detail & Related papers (2022-03-27T17:16:11Z) - Fact-driven Logical Reasoning for Machine Reading Comprehension [82.58857437343974]
We are motivated to cover both commonsense and temporary knowledge clues hierarchically.
Specifically, we propose a general formalism of knowledge units by extracting backbone constituents of the sentence.
We then construct a supergraph on top of the fact units, allowing for the benefit of sentence-level (relations among fact groups) and entity-level interactions.
arXiv Detail & Related papers (2021-05-21T13:11:13Z) - Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks.
We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way.
arXiv Detail & Related papers (2020-10-26T21:34:39Z) - Knowledgeable Dialogue Reading Comprehension on Key Turns [84.1784903043884]
Multi-choice machine reading comprehension (MRC) requires models to choose the correct answer from candidate options given a passage and a question.
Our research focuses dialogue-based MRC, where the passages are multi-turn dialogues.
It suffers from two challenges, the answer selection decision is made without support of latently helpful commonsense, and the multi-turn context may hide considerable irrelevant information.
arXiv Detail & Related papers (2020-04-29T07:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.