An automatically discovered chain-of-thought prompt generalizes to novel
models and datasets
- URL: http://arxiv.org/abs/2305.02897v2
- Date: Thu, 3 Aug 2023 14:33:37 GMT
- Title: An automatically discovered chain-of-thought prompt generalizes to novel
models and datasets
- Authors: Konstantin Hebenstreit, Robert Praas, Louis P Kiesewetter, Matthias
Samwald
- Abstract summary: Chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs)
We compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs.
Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets.
- Score: 4.693905948827508
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emergent chain-of-thought (CoT) reasoning capabilities promise to improve
performance and explainability of large language models (LLMs). However,
uncertainties remain about how reasoning strategies formulated for previous
model generations generalize to new model generations and different datasets.
In this small-scale study, we compare different reasoning strategies induced by
zero-shot prompting across six recently released LLMs (davinci-002,
davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge) on a
mixture of six question-answering datasets, including datasets from scientific
and medical domains. Our findings demonstrate that while some variations in
effectiveness occur, gains from CoT reasoning strategies remain robust across
different models and datasets. GPT-4 has the most benefit from current
state-of-the-art reasoning strategies and exhibits the best performance by
applying a prompt previously discovered through automated discovery.
Related papers
- ERNIE 5.0 Technical Report [244.36480708815316]
ERNIE 5.0 is a unified autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio.<n>To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm.<n>We show that ERNIE 5.0 achieves strong and balanced performance across multiple modalities.
arXiv Detail & Related papers (2026-02-04T16:18:15Z) - CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling [60.55856973678002]
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning.<n>Existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs.<n>We propose textbfCALM, a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks.
arXiv Detail & Related papers (2025-10-05T13:38:31Z) - NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning [5.461464418720756]
NanoFlux is a novel adversarial framework for generating targeted training data to improve LLM reasoning.<n>The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge.<n>Fine-tuning a 4B- parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning.
arXiv Detail & Related papers (2025-09-27T11:05:46Z) - Evaluating Retrieval-Augmented Generation Strategies for Large Language Models in Travel Mode Choice Prediction [5.638676750474513]
This study explores the potential of Large Language Models (LLMs) as a more flexible and context-aware approach to travel mode choice prediction.<n>We develop a modular framework for integrating Retrieval-Augmented Generation (RAG) into LLM-based travel mode choice prediction.<n>Using the 2023 Puget Sound Regional Household Travel Survey data, we conduct a series of experiments to evaluate model performance.
arXiv Detail & Related papers (2025-08-24T21:20:55Z) - EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding [50.29046178980637]
EpiCoDe is a method that boosts model performance in data-scarcity scenarios without extra training.<n>We show that EpiCoDe consistently outperforms existing methods with significant and robust improvement.
arXiv Detail & Related papers (2025-06-04T02:11:54Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models [17.673293240849787]
We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs)
SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories.
We show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks.
arXiv Detail & Related papers (2025-03-04T14:43:25Z) - Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models [6.349503549199403]
This study presents a general evaluation framework using four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity.
We assess the effectiveness of six explainability techniques from five different XAI categories.
Our findings show that the model simplification-based XAI method (LIME) consistently outperforms across multiple metrics and models.
arXiv Detail & Related papers (2025-01-26T03:08:34Z) - Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.
Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.
Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z) - A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)
We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.
Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models [14.023953508288628]
Retrieval augmented generation (RAG) pipelines are commonly used in tasks such as question-answering (QA)
We propose REFINE, a novel technique that generates synthetic data from available documents and then uses a model fusion approach to fine-tune embeddings.
arXiv Detail & Related papers (2024-10-16T08:43:39Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Towards Effective and Efficient Continual Pre-training of Large Language Models [163.34610964970258]
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks.
This paper presents a technical report for continually pre-training Llama-3 (8B)
It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
arXiv Detail & Related papers (2024-07-26T13:55:21Z) - MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs [38.127313175508746]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset.
Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique.
MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z) - Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection [10.301985230669684]
This paper presents a comprehensive analysis of GPT-4, GPT-3.5 Turbo, and FLAN-T5 models in detecting framing in news headlines.
We evaluated these models in various scenarios: zero-shot, few-shot with in-domain examples, cross-domain examples, and settings where models explain their predictions.
arXiv Detail & Related papers (2024-02-18T15:27:48Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models [7.428199805959228]
We show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods.
On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives.
arXiv Detail & Related papers (2024-01-20T19:50:51Z) - Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity.
Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z) - A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models [71.42197262495056]
GPT series models have gained considerable attention due to their exceptional natural language processing capabilities.
We select six representative models, comprising two GPT-3 series models and four GPT-3.5 series models.
We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets.
Our experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve.
arXiv Detail & Related papers (2023-03-18T14:02:04Z) - S^3-Rec: Self-Supervised Learning for Sequential Recommendation with
Mutual Information Maximization [104.87483578308526]
We propose the model S3-Rec, which stands for Self-Supervised learning for Sequential Recommendation.
For our task, we devise four auxiliary self-supervised objectives to learn the correlations among attribute, item, subsequence, and sequence.
Extensive experiments conducted on six real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art methods.
arXiv Detail & Related papers (2020-08-18T11:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.