Related papers: From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

URL: http://arxiv.org/abs/2407.03778v1
Date: Thu, 4 Jul 2024 09:38:49 GMT
Title: From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI
Authors: Stefanie Krause, Frieder Stolzenburg,
Abstract summary: We study the effectiveness of large language models (LLMs) on different question answering tasks. We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets. Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent"
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Commonsense reasoning is a difficult task for a computer, but a critical skill for an artificial intelligence (AI). It can enhance the explainability of AI models by enabling them to provide intuitive and human-like explanations for their decisions. This is necessary in many areas especially in question answering (QA), which is one of the most important tasks of natural language processing (NLP). Over time, a multitude of methods have emerged for solving commonsense reasoning problems such as knowledge-based approaches using formal logic or linguistic analysis. In this paper, we investigate the effectiveness of large language models (LLMs) on different QA tasks with a focus on their abilities in reasoning and explainability. We study three LLMs: GPT-3.5, Gemma and Llama 3. We further evaluate the LLM results by means of a questionnaire. We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets. While GPT-3.5's accuracy ranges from 56% to 93% on various QA benchmarks, Llama 3 achieved a mean accuracy of 90% on all eleven datasets. Thereby Llama 3 is outperforming humans on all datasets with an average 21% higher accuracy over ten datasets. Furthermore, we can appraise that, in the sense of explainable artificial intelligence (XAI), GPT-3.5 provides good explanations for its decisions. Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent". Taken together, these findings enrich our understanding of current LLMs and pave the way for future investigations of reasoning and explainability.

Related papers

LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models. We identify five essential types of world knowledge for question formulation. We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z)
CAUS: A Dataset for Question Generation based on Human Cognition Leveraging Large Language Models [4.962252439662465]
We introduce the Curious About Uncertain Scene dataset to enable Large Language Models to emulate human cognitive processes for resolving uncertainties. Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries. Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions.
arXiv Detail & Related papers (2024-04-18T01:31:19Z)
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z)
XplainLLM: A QA Explanation Dataset for Understanding LLM Decision-Making [13.928951741632815]
Large Language Models (LLMs) have recently made impressive strides in natural language understanding tasks. In this paper, we look into bringing some transparency to this process by introducing a new explanation dataset. Our dataset includes 12,102 question-answer-explanation (QAE) triples.
arXiv Detail & Related papers (2023-11-15T00:34:28Z)
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models [3.518832148294879]
How to evaluate the question-solving abilities of large language models like ChatGPT is a hot-spot but challenging issue. We propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs.
arXiv Detail & Related papers (2023-09-05T13:43:37Z)
Explanations from Large Language Models Make Small Reasoners Better [61.991772773700006]
We show that our method can consistently and significantly outperform finetuning baselines across different settings. As a side benefit, human evaluation shows that our method can generate high-quality explanations to justify its predictions.
arXiv Detail & Related papers (2022-10-13T04:50:02Z)
On Explainability in AI-Solutions: A Cross-Domain Survey [4.394025678691688]
In automatically deriving a system model, AI algorithms learn relations in data that are not detectable for humans. The more complex a model, the more difficult it is for a human to understand the reasoning for the decisions. This work provides an extensive survey of literature on this topic, which, to a large part, consists of other surveys.
arXiv Detail & Related papers (2022-10-11T06:21:47Z)
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks [37.730939229638224]
We propose NumGLUE, a benchmark that evaluates the performance of AI systems on eight different tasks. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models. We hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language.
arXiv Detail & Related papers (2022-04-12T09:36:10Z)
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions. Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.