CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge
- URL: http://arxiv.org/abs/2504.14462v1
- Date: Sun, 20 Apr 2025 02:47:18 GMT
- Title: CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge
- Authors: Armin Toroghi, Willis Guo, Scott Sanner,
- Abstract summary: We present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa)<n>CoLoTa consists of 3,300 queries from question answering and claim verification tasks.<n>We propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities.
- Score: 19.34131843380852
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.
Related papers
- Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge.
We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions.
Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv Detail & Related papers (2024-05-10T15:10:20Z) - Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z) - CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring
Commonsense Reasoning and Long-Tail Knowledge [21.73770363188049]
We create a novel Commonsense Reasoning (CR) and Long-Tail (LT) KGQA dataset with two subtasks -- question answering and claim verification.
While existing KGQA methods are not applicable due to their lack of commonsense inference support, baseline evaluation of LLMs on CR-LT KGQA demonstrate a high rate of hallucination.
arXiv Detail & Related papers (2024-03-03T04:47:01Z) - Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering [18.48602809114524]
Knowledge Graph Question Answering (KGQA) methods seek to answer Natural Language questions using the relational information stored in Knowledge Graphs (KGs)<n>With the recent advancements of Large Language Models (LLMs) and their remarkable reasoning abilities, there is a growing trend to leverage them for KGQA.<n>We propose Right for Right Reasons (R3), a commonsense KGQA methodology that allows for a verifiable reasoning procedure.
arXiv Detail & Related papers (2024-03-03T04:22:13Z) - Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities.
We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z) - CLadder: Assessing Causal Reasoning in Language Models [82.8719238178569]
We investigate whether large language models (LLMs) can coherently reason about causality.
We propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al.
arXiv Detail & Related papers (2023-12-07T15:12:12Z) - Reasoning on Graphs: Faithful and Interpretable Large Language Model
Reasoning [104.92384929827776]
Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks.
They lack up-to-date knowledge and experience hallucinations during reasoning.
Knowledge graphs (KGs) offer a reliable source of knowledge for reasoning.
arXiv Detail & Related papers (2023-10-02T10:14:43Z) - Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for
Knowledge-intensive Question Answering [17.672572064705445]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) have shown impressive reasoning ability in various downstream tasks.
We propose a framework called Knowledge-Driven Chain-of-Thought (KD-CoT) to verify and modify reasoning traces in CoT via interaction with external knowledge.
arXiv Detail & Related papers (2023-08-25T09:23:55Z) - A Benchmark for Generalizable and Interpretable Temporal Question
Answering over Knowledge Bases [67.33560134350427]
TempQA-WD is a benchmark dataset for temporal reasoning.
It is based on Wikidata, which is the most frequently curated, openly available knowledge base.
arXiv Detail & Related papers (2022-01-15T08:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.