Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage
- URL: http://arxiv.org/abs/2505.08167v2
- Date: Wed, 14 May 2025 01:35:33 GMT
- Title: Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage
- Authors: Ruilin Liu, Zhixiao Zhao, Jieqiong Li, Chang Liu, Dongbo Wang,
- Abstract summary: We propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism.<n>This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage.
- Score: 3.7756107931620666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model's latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model's outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.
Related papers
- Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability [16.441081996257576]
We propose leveraging reasoning-intensive models to improve less computationally demanding, non-reasoning models.<n>We demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
arXiv Detail & Related papers (2025-04-13T16:26:56Z) - Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model [62.11981915549919]
Domain Guidance is a transfer approach that leverages pre-trained knowledge to guide the sampling process toward the target domain.<n>We demonstrate its substantial effectiveness across various transfer benchmarks, achieving over a 19.6% improvement in FID and a 23.4% improvement in FD$_textDINOv2$ compared to standard fine-tuning.
arXiv Detail & Related papers (2025-04-02T09:07:55Z) - Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique [66.94905631175209]
We propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL)<n>It employs self-generated natural language critiques as feedback to guide the step-level search process.<n>This approach bypasses the need for task-specific verifiers and the associated training overhead.
arXiv Detail & Related papers (2025-03-21T17:59:55Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [86.21199607040147]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation language models.<n>We introduce Chain-of-Description, a step-by-step visual understanding method, and integrate structured chain-of-thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Extensive experiments demonstrate that SIcog produces next-generation foundation MLLMs with substantially improved multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Iterative Deepening Sampling for Large Language Models [27.807695570974644]
Training models to achieve effective self-correction and self-correction remains a significant challenge.<n>We propose a novel iterative sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z) - Multimodal Magic Elevating Depression Detection with a Fusion of Text and Audio Intelligence [4.92323103166693]
This study proposes an innovative multimodal fusion model based on a teacher-student architecture to enhance the accuracy of depression classification.<n>Our designed model addresses the limitations of traditional methods in feature fusion and modality weight allocation by introducing multi-head attention mechanisms and weighted multimodal transfer learning.<n>Ablation experiments demonstrate that the proposed model attains an F1 score of 99. 1% on the test set, significantly outperforming unimodal and conventional approaches.
arXiv Detail & Related papers (2025-01-28T09:30:29Z) - B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners [18.960920426485163]
Self-improvement has emerged as a primary method for enhancing performance.<n>We identify and propose methods to monitor two pivotal factors in this iterative process.<n>We introduce B-STaR, a Self-Taught Reasoning framework that adjusts configurations across iterations to balance exploration and exploitation.
arXiv Detail & Related papers (2024-12-23T03:58:34Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Enhancing Fairness and Performance in Machine Learning Models: A Multi-Task Learning Approach with Monte-Carlo Dropout and Pareto Optimality [1.5498930424110338]
This study introduces an approach to mitigate bias in machine learning by leveraging model uncertainty.
Our approach utilizes a multi-task learning (MTL) framework combined with Monte Carlo (MC) Dropout to assess and mitigate uncertainty in predictions related to protected labels.
arXiv Detail & Related papers (2024-04-12T04:17:50Z) - Out-of-Distribution Detection via Deep Multi-Comprehension Ensemble [11.542472900306745]
Multi-Comprehension (MC) Ensemble is proposed as a strategy to augment the Out-of-Distribution (OOD) feature representation field.
Our experimental results demonstrate the superior performance of the MC Ensemble strategy in OOD detection.
This underscores the effectiveness of our proposed approach in enhancing the model's capability to detect instances outside its training distribution.
arXiv Detail & Related papers (2024-03-24T18:43:04Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.