Related papers: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning

Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning

URL: http://arxiv.org/abs/2507.00214v1
Date: Mon, 30 Jun 2025 19:34:57 GMT
Title: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning
Authors: Mads Henrichsen, Rasmus Krebs,
Abstract summary: This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings.<n>In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset.<n>In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning (R) given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning (R) immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q->RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q->A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.

Related papers

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning [12.83211408922535]
Reinforcement learning-style post-training improves reasoning by optimizing model outputs based on reward or preference signals.<n> GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier.<n>We propose $textbfSelf-Explanation Policy Optimization (ExPO)$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer.
arXiv Detail & Related papers (2025-07-03T17:44:55Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring [0.0]
Reasoning Distillation-Based Evaluation (RDBE) integrates interpretability to elucidate the rationale behind model scores. Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset.
arXiv Detail & Related papers (2024-07-03T05:49:01Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Improving Language Model Reasoning with Self-motivated Learning [60.779625789039486]
textitSelf-motivated Learning framework motivates the model itself to automatically generate rationales on existing datasets. We train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning.
arXiv Detail & Related papers (2024-04-10T14:05:44Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning [25.496627355906966]
We develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus"<n>Experiments show that these simple augmentations greatly hinder the models' performance.<n>Applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models.
arXiv Detail & Related papers (2023-10-13T22:29:15Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM [31.25193238045053]
We introduce a novel method, namely GenCo, which leverages the strong generative power of large language models to assist in training a smaller language model. In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways. It helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels.
arXiv Detail & Related papers (2023-04-24T07:35:38Z)
ALEX: Active Learning based Enhancement of a Model's Explainability [34.26945469627691]
An active learning (AL) algorithm seeks to construct an effective classifier with a minimal number of labeled examples in a bootstrapping manner. In the era of data-driven learning, this is an important research direction to pursue. This paper describes our work-in-progress towards developing an AL selection function that in addition to model effectiveness also seeks to improve on the interpretability of a model during the bootstrapping steps.
arXiv Detail & Related papers (2020-09-02T07:15:39Z)
Interpretable Learning-to-Rank with Generalized Additive Models [78.42800966500374]
Interpretability of learning-to-rank models is a crucial yet relatively under-examined research area. Recent progress on interpretable ranking models largely focuses on generating post-hoc explanations for existing black-box ranking models. We lay the groundwork for intrinsically interpretable learning-to-rank by introducing generalized additive models (GAMs) into ranking tasks.
arXiv Detail & Related papers (2020-05-06T01:51:30Z)
ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning [85.33459673197149]
We introduce a new Reading dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations. In this paper, we propose to identify biased data points and separate them into EASY set and the rest as HARD set. Empirical results show that state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set. However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models.
arXiv Detail & Related papers (2020-02-11T11:54:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.