Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
- URL: http://arxiv.org/abs/2503.23899v1
- Date: Mon, 31 Mar 2025 09:48:59 GMT
- Title: Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
- Authors: Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery,
- Abstract summary: We present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated.<n>Using Rubrik, we find that explanations are influenced by both task and perceived difficulty.<n>Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice.
- Score: 14.64908019263248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.
Related papers
- END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.
They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.
We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding [28.191029786204624]
We introduce the Long Question Coreference Adaptation (LQCA) method to enhance the performance of large language models (LLMs)<n>This framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively.<n>Our code is public at https://github.com/OceannTwT/LQCA.
arXiv Detail & Related papers (2024-10-02T15:39:55Z) - DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning [19.93800175353809]
DeTriever is a novel demonstration retrieval framework that learns a weighted combination of hidden states.
Our method significantly outperforms the state-of-the-art baselines on one-shot NL2 tasks.
arXiv Detail & Related papers (2024-06-12T06:33:54Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z) - XplainLLM: A QA Explanation Dataset for Understanding LLM
Decision-Making [13.928951741632815]
Large Language Models (LLMs) have recently made impressive strides in natural language understanding tasks.
In this paper, we look into bringing some transparency to this process by introducing a new explanation dataset.
Our dataset includes 12,102 question-answer-explanation (QAE) triples.
arXiv Detail & Related papers (2023-11-15T00:34:28Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Towards LLM-guided Causal Explainability for Black-box Text Classifiers [16.36602400590088]
We aim to leverage the instruction-following and textual understanding capabilities of recent Large Language Models to facilitate causal explainability.
We propose a three-step pipeline via which, we use an off-the-shelf LLM to identify the latent or unobserved features in the input text.
We experiment with our pipeline on multiple NLP text classification datasets, and present interesting and promising findings.
arXiv Detail & Related papers (2023-09-23T11:22:28Z) - Explanation Selection Using Unlabeled Data for Chain-of-Thought
Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance.
This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.