Chain-of-Model Learning for Language Model
- URL: http://arxiv.org/abs/2505.11820v2
- Date: Fri, 23 May 2025 08:12:10 GMT
- Title: Chain-of-Model Learning for Language Model
- Authors: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu,
- Abstract summary: We propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style.<n>We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level.<n>Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains.
- Score: 91.81240728426994
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.
Related papers
- Dynamic Chunking for End-to-End Hierarchical Sequence Modeling [17.277753030570263]
We introduce techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies.<n> incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end.<n>Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction.<n>H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without anys or explicit supervision.
arXiv Detail & Related papers (2025-07-10T17:39:37Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning [86.15009879251386]
We propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBM)
CBMs require an additional set of concepts to leverage.
We show a significant increase in accuracy using sparse hidden layers in CLIP-based bottleneck models.
arXiv Detail & Related papers (2024-04-04T09:43:43Z) - Training-Free Pretrained Model Merging [38.16269074353077]
We propose an innovative model merging framework, coined as merging under dual-space constraints (MuDSC)
In order to enhance usability, we have also incorporated adaptations for group structure, including Multi-Head Attention and Group Normalization.
arXiv Detail & Related papers (2024-03-04T06:19:27Z) - FedGuCci: Making Local Models More Connected in Landscape for Federated Learning [22.524854255672256]
Federated learning (FL) involves multiple clients collaboratively training a global model via iterative local updates and model fusion.<n>In this paper, we study and improve FL's generalization through a fundamental connectivity'' perspective.<n>We propose FedGuCci, improving group connectivity for better generalization.
arXiv Detail & Related papers (2024-02-29T08:27:01Z) - Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference [32.62084449979531]
We extend SortedNet to generative NLP tasks by replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT)
Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference.
Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit)
arXiv Detail & Related papers (2023-09-16T11:58:34Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - AI Chains: Transparent and Controllable Human-AI Interaction by Chaining
Large Language Model Prompts [12.73129785710807]
We introduce the concept of Chaining LLM steps together, where the output of one step becomes the input for the next, thus aggregating the gains per step.
In a 20-person user study, we found that Chaining not only improved the quality of task outcomes, but also significantly enhanced system transparency, controllability, and sense of collaboration.
arXiv Detail & Related papers (2021-10-04T19:59:38Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.