Related papers: An Analysis for Reasoning Bias of Language Models with Small Initialization

An Analysis for Reasoning Bias of Language Models with Small Initialization

URL: http://arxiv.org/abs/2502.04375v1
Date: Wed, 05 Feb 2025 15:23:26 GMT
Title: An Analysis for Reasoning Bias of Language Models with Small Initialization
Authors: Junjie Yao, Zhongwang Zhang, Zhi-Qin John Xu,
Abstract summary: Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks.<n>This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs.
Score: 8.380004565348619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks. This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs. We discover that smaller initialization scales encourage models to favor reasoning tasks, whereas larger initialization scales lead to a preference for memorization tasks. We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. This work enhances our understanding of how initialization strategies influence LLM performance on reasoning tasks and offers valuable guidelines for training models.

Related papers

Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models [43.76842321707181]
In this study, we shed light on how the interplay between these core paradigms influences Large Language Model (LLM) reasoning.<n>We first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms.<n>We then investigate effective ways for inducing these skills into LLMs.
arXiv Detail & Related papers (2026-02-09T13:51:48Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
Estimating the Effects of Sample Training Orders for Large Language Models without Retraining [49.59675538160363]
The order of training samples plays a crucial role in large language models (LLMs)<n>Traditional methods for investigating this effect generally require retraining the model with various sample orders.<n>We improve traditional methods by designing a retraining-free framework.
arXiv Detail & Related papers (2025-05-28T07:07:02Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning [39.827685159198296]
Catastrophic forgetting (CF) poses a significant challenge in machine learning, where a model forgets previously learned information upon learning new tasks. Our study explores CF across various settings, discovering that model forgetting is influenced by both the specific training tasks and the models themselves. We propose a novel function vector guided training methodology, incorporating a regularization technique to stabilize the FV and forgetting.
arXiv Detail & Related papers (2025-02-16T07:06:17Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.<n>We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts.<n>Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)<n>In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.<n>Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>We show that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Towards Optimal Learning of Language Models [124.65669486710992]
We present a theory for the optimal learning of language models (LMs) We derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. We empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs.
arXiv Detail & Related papers (2024-02-27T18:52:19Z)
Transformer-based Causal Language Models Perform Clustering [20.430255724239448]
We introduce a simplified instruction-following task and use synthetic datasets to analyze a Transformer-based causal language model. Our findings suggest that the model learns task-specific information by clustering data within its hidden space, with this clustering process evolving dynamically during learning.
arXiv Detail & Related papers (2024-02-19T14:02:31Z)
Concept-aware Training Improves In-context Learning Ability of Language Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability. We propose a method to create LMs able to better utilize the in-context information. We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z)
Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models [96.9373147383119]
We show that weight disentanglement is the crucial factor that makes task arithmetic effective. We show that fine-tuning models in their tangent space by linearizing them amplifies weight disentanglement. This leads to substantial performance improvements across task arithmetic benchmarks and diverse models.
arXiv Detail & Related papers (2023-05-22T08:39:25Z)
Post Hoc Explanations of Language Models Can Improve Language Models [43.2109029463221]
We present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY) We leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. Our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks.
arXiv Detail & Related papers (2023-05-19T04:46:04Z)
Competence-Based Analysis of Language Models [21.43498764977656]
CALM (Competence-based Analysis of Language Models) is designed to investigate LLM competence in the context of specific tasks. We develop a new approach for performing causal probing interventions using gradient-based adversarial attacks. We carry out a case study of CALM using these interventions to analyze and compare LLM competence across a variety of lexical inference tasks.
arXiv Detail & Related papers (2023-03-01T08:53:36Z)
LMPriors: Pre-Trained Language Models as Task-Specific Priors [78.97143833642971]
We develop principled techniques for augmenting our models with suitable priors. This is to encourage them to learn in ways that are compatible with our understanding of the world. We draw inspiration from the recent successes of large-scale language models (LMs) to construct task-specific priors distilled from the rich knowledge of LMs.
arXiv Detail & Related papers (2022-10-22T19:09:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.