Related papers: WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

URL: http://arxiv.org/abs/2502.07747v1
Date: Tue, 11 Feb 2025 18:14:44 GMT
Title: WHODUNIT: Evaluation benchmark for culprit detection in mystery stories
Authors: Kshitij Gupta,
Abstract summary: We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM)<n>The dataset challenges LLMs to identify the perpetrator after reading and comprehending the story.<n>We apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse.
Score: 3.42528786340268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback [57.200668979963694]
We present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues.<n>We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics.
arXiv Detail & Related papers (2025-07-21T18:56:50Z)
Can Large Language Models Really Recognize Your Name? [23.4365383606717]
We show that modern large language models (LLMs) regularly overlook human names even in short text snippets due to ambiguous contexts.<n>We propose AMBENCH, a benchmark dataset of seemingly ambiguous human names.
arXiv Detail & Related papers (2025-05-20T16:05:05Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information [0.0]
The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking. We use AI auditing methodology that systematically evaluates performance of five LLMs. The results indicate that models are better at identifying false statements, especially on sensitive topics.
arXiv Detail & Related papers (2025-03-11T13:06:40Z)
Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization [33.67670065326008]
This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs) Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness.
arXiv Detail & Related papers (2024-10-24T10:32:10Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review [8.606381080620789]
We investigate the ability of existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs.<n>Our analysis shows that existing approaches fail to identify many GPT-4o written reviews without also producing a high number of false positive classifications.<n>We propose a new detection approach which surpasses existing methods in the identification of GPT-4o written peer reviews at low levels of false positive classifications.
arXiv Detail & Related papers (2024-10-03T22:05:06Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task. We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
RELIC: Investigating Large Language Model Responses using Self-Consistency [58.63436505595177]
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. We propose an interactive system that helps users gain insight into the reliability of the generated text.
arXiv Detail & Related papers (2023-11-28T14:55:52Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. We collect responses generated from large language models and annotate factuality labels in a fine-grained manner. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [53.78782375511531]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks. This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR) To address concerns about data contamination of LLMs, we collect a new test set called NovelEval. To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.