Related papers: Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

URL: http://arxiv.org/abs/2503.23899v1
Date: Mon, 31 Mar 2025 09:48:59 GMT
Title: Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Authors: Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery,
Abstract summary: We present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated.<n>Using Rubrik, we find that explanations are influenced by both task and perceived difficulty.<n>Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice.
Score: 14.64908019263248
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.

Related papers

END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. They are often distracted by irrelevant or noisy context in input sequences that degrades output quality. We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding [28.191029786204624]
We introduce the Long Question Coreference Adaptation (LQCA) method to enhance the performance of large language models (LLMs)<n>This framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively.<n>Our code is public at https://github.com/OceannTwT/LQCA.
arXiv Detail & Related papers (2024-10-02T15:39:55Z)
Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering [58.17090503446995]
We focus on a conversational question answering task which combines the challenges of understanding questions in context and reasoning over evidence gathered from heterogeneous sources like text, knowledge graphs, tables, and infoboxes. Our method utilizes a graph structured representation to aggregate information about a question and its context.
arXiv Detail & Related papers (2024-06-14T13:28:03Z)
DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning [19.93800175353809]
DeTriever is a novel demonstration retrieval framework that learns a weighted combination of hidden states. Our method significantly outperforms the state-of-the-art baselines on one-shot NL2 tasks.
arXiv Detail & Related papers (2024-06-12T06:33:54Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)
XplainLLM: A QA Explanation Dataset for Understanding LLM Decision-Making [13.928951741632815]
Large Language Models (LLMs) have recently made impressive strides in natural language understanding tasks. In this paper, we look into bringing some transparency to this process by introducing a new explanation dataset. Our dataset includes 12,102 question-answer-explanation (QAE) triples.
arXiv Detail & Related papers (2023-11-15T00:34:28Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Towards LLM-guided Causal Explainability for Black-box Text Classifiers [16.36602400590088]
We aim to leverage the instruction-following and textual understanding capabilities of recent Large Language Models to facilitate causal explainability. We propose a three-step pipeline via which, we use an off-the-shelf LLM to identify the latent or unobserved features in the input text. We experiment with our pipeline on multiple NLP text classification datasets, and present interesting and promising findings.
arXiv Detail & Related papers (2023-09-23T11:22:28Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
Better patching using LLM prompting, via Self-Consistency [5.892272127970584]
Self-consistency (S-C) is an exciting, substantially better technique for generating explanations for problems. This paper describes an application of the S-C approach to program repair, using the commit log on the fix as the explanation. We achieve state-of-the art results, beating previous approaches to prompting-based program repair on the MODIT dataset.
arXiv Detail & Related papers (2023-05-31T18:28:46Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
LIREx: Augmenting Language Inference with Relevant Explanation [1.4780878458667916]
Natural language explanations (NLEs) are a form of data annotation in which annotators identify rationales when assigning labels to data instances. NLEs have been shown to capture human reasoning better, but not as beneficial for natural language inference. We propose a novel framework, LIREx, that incorporates both a rationale-enabled explanation generator and an instance selector to select only relevant NLEs.
arXiv Detail & Related papers (2020-12-16T18:49:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.