Related papers: Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

URL: http://arxiv.org/abs/2505.23368v2
Date: Tue, 03 Jun 2025 09:45:05 GMT
Title: Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
Authors: Beiduo Chen, Yang Janet Liu, Anna Korhonen, Barbara Plank,
Abstract summary: Large Language Models (LLMs) generate chains of thought (CoTs) before giving the final answer.<n>We propose a novel pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option.<n>We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores.
Score: 44.25455164977285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

Related papers

Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering [27.193336817953142]
We integrate different discrete subset sampling methods into a graph-based visual question answering system.<n>We show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy.<n>We also conduct a human evaluation to assess the interpretability of the generated subgraphs.
arXiv Detail & Related papers (2024-12-11T10:18:37Z)
Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension [9.67774998354062]
Previous research has primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation.<n>We propose a Premise-Oriented Data Augmentation (PODA) framework to generate CoT rationales including analyses for both correct and incorrect options.<n>We also introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples.
arXiv Detail & Related papers (2024-09-22T15:44:43Z)
REAL: Response Embedding-based Alignment for LLMs [1.9513983244114355]
We propose a strategy for constructing a high-quality training dataset that focuses on acquiring the less ambiguous preference pairs.<n>Experiments show that choosing dissimilar response pairs enhances the direct alignment of LLMs.<n>Findings suggest that focusing on distinct pairs can reduce the label error and improve LLM alignment efficiency.
arXiv Detail & Related papers (2024-09-17T22:40:54Z)
Self-Consistent Decoding for More Factual Open Responses [28.184313177333642]
"Sample & Select" improves factuality by a 30% relative margin against decoders of DoLA, P-CRR, and S-CRR. We collect human verifications of the generated summaries, confirming the factual superiority of our method.
arXiv Detail & Related papers (2024-03-01T17:31:09Z)
Hierarchical Indexing for Retrieval-Augmented Opinion Summarization [60.5923941324953]
We propose a method for unsupervised abstractive opinion summarization that combines the attributability and scalability of extractive approaches with the coherence and fluency of Large Language Models (LLMs) Our method, HIRO, learns an index structure that maps sentences to a path through a semantically organized discrete hierarchy. At inference time, we populate the index and use it to identify and retrieve clusters of sentences containing popular opinions from input reviews.
arXiv Detail & Related papers (2024-03-01T10:38:07Z)
IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models [63.15355173909631]
This paper introduces an influence-driven selective annotation method.<n>It aims to minimize annotation costs while improving the quality of in-context examples.<n> Experiments confirm the superiority of the proposed method on various benchmarks.
arXiv Detail & Related papers (2023-10-16T22:53:54Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
Neighbour Consistency Guided Pseudo-Label Refinement for Unsupervised Person Re-Identification [80.98291772215154]
Unsupervised person re-identification (ReID) aims at learning discriminative identity features for person retrieval without any annotations. Recent advances accomplish this task by leveraging clustering-based pseudo labels. We propose a Neighbour Consistency guided Pseudo Label Refinement framework.
arXiv Detail & Related papers (2022-11-30T09:39:57Z)
A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.