Related papers: CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

URL: http://arxiv.org/abs/2303.03628v1
Date: Tue, 7 Mar 2023 03:23:14 GMT
Title: CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification
Authors: Seungone Kim, Se June Joo, Yul Jang, Hyungjoo Chae, Jinyoung Yeo
Abstract summary: Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models with explanation data is needed. CoTEVer is a tool-kit for annotating the factual correctness of generated explanations and collecting revision data of wrong explanations.
Score: 1.658938566492109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models with explanation data is needed. However, there exists only a few datasets that can be used for such approaches, and no data collection tool for building them. Thus, we introduce CoTEVer, a tool-kit for annotating the factual correctness of generated explanations and collecting revision data of wrong explanations. Furthermore, we suggest several use cases where the data collected with CoTEVer can be utilized for enhancing the faithfulness of explanations. Our toolkit is publicly available at https://github.com/SeungoneKim/CoTEVer.

Related papers

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset [14.64908019263248]
We present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice.
arXiv Detail & Related papers (2025-03-31T09:48:59Z)
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data. We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
Efficient and Accurate Explanation Estimation with Distribution Compression [17.299418894910627]
We introduce Compress Then Explain (CTE), a new paradigm for more efficient and accurate explanation estimation. CTE uses distribution compression through kernel thinning to obtain a data sample that best approximates the marginal distribution. It often achieves an on-par explanation approximation error using 2-3x less samples, i.e. requiring 2-3x less model evaluations.
arXiv Detail & Related papers (2024-06-26T13:21:24Z)
A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning [48.51969964676017]
Chain-of-Thought (CoT) holds a significant place in augmenting the reasoning performance for large language models. We propose a Read-and-Control approach for controlling the accuracy of CoT.
arXiv Detail & Related papers (2024-06-18T04:07:13Z)
Better patching using LLM prompting, via Self-Consistency [5.892272127970584]
Self-consistency (S-C) is an exciting, substantially better technique for generating explanations for problems. This paper describes an application of the S-C approach to program repair, using the commit log on the fix as the explanation. We achieve state-of-the art results, beating previous approaches to prompting-based program repair on the MODIT dataset.
arXiv Detail & Related papers (2023-05-31T18:28:46Z)
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting [43.458726163197824]
Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output. We find that CoT explanations can systematically misrepresent the true reason for a model's prediction.
arXiv Detail & Related papers (2023-05-07T22:44:25Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
ExaRanker: Explanation-Augmented Neural Ranker [67.4894325619275]
In this work, we show that neural rankers also benefit from explanations. We use LLMs such as GPT-3.5 to augment retrieval datasets with explanations. Our model, dubbed ExaRanker, finetuned on a few thousand examples with synthetic explanations performs on par with models finetuned on 3x more examples without explanations.
arXiv Detail & Related papers (2023-01-25T11:03:04Z)
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z)
Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation [92.1582872870226]
We propose a new grounded keys-to-text generation task. The task is to generate a factual description about an entity given a set of guiding keys, and grounding passages. Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions.
arXiv Detail & Related papers (2022-12-04T23:59:41Z)
Task-Agnostic Graph Explanations [50.17442349253348]
Graph Neural Networks (GNNs) have emerged as powerful tools to encode graph structured data. Existing learning-based GNN explanation approaches are task-specific in training. We propose a Task-Agnostic GNN Explainer (TAGE) trained under self-supervision with no knowledge of downstream tasks.
arXiv Detail & Related papers (2022-02-16T21:11:47Z)
Explaining Inference Queries with Bayesian Optimization [16.448164301763168]
Inference query explanation seeks to explain unexpected aggregate query results on inference data. An explanation may need to be derived from the source, training, or inference data in an ML pipeline. We propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO)
arXiv Detail & Related papers (2021-02-10T08:08:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.