How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
- URL: http://arxiv.org/abs/2505.18522v1
- Date: Sat, 24 May 2025 05:40:03 GMT
- Title: How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
- Authors: Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu,
- Abstract summary: This work focuses on the impact of sequence modeling architectures on base capabilities.<n>We first point out that the mixed domain pre-training setting fails to adequately reveal the differences in base capabilities among various architectures.<n>Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer.
- Score: 37.57021686999279
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.
Related papers
- The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting [26.76928230531243]
Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF)<n> variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks?<n>Existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself.<n>We propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures.
arXiv Detail & Related papers (2025-07-17T12:16:04Z) - A unified framework on the universal approximation of transformer-type architectures [16.762119652883204]
We investigate the universal approximation property (UAP) of transformer-type architectures.<n>Our work identifies token distinguishability as a fundamental requirement for UAP.<n>We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms.
arXiv Detail & Related papers (2025-06-30T06:50:39Z) - AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - Cliqueformer: Model-Based Optimization with Structured Transformers [102.55764949282906]
Large neural networks excel at prediction tasks, but their application to design problems, such as protein engineering or materials discovery, requires solving offline model-based optimization (MBO) problems.<n>We present Cliqueformer, a transformer-based architecture that learns the black-box function's structure through functional graphical models (FGM)<n>Across various domains, including chemical and genetic design tasks, Cliqueformer demonstrates superior performance compared to existing methods.
arXiv Detail & Related papers (2024-10-17T00:35:47Z) - Exploring the design space of deep-learning-based weather forecasting systems [56.129148006412855]
This paper systematically analyzes the impact of different design choices on deep-learning-based weather forecasting systems.
We study fixed-grid architectures such as UNet, fully convolutional architectures, and transformer-based models.
We propose a hybrid system that combines the strong performance of fixed-grid models with the flexibility of grid-invariant architectures.
arXiv Detail & Related papers (2024-10-09T22:25:50Z) - How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers [29.901110957318924]
This work attempts to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers.
We successfully achieved significant improvements in base capabilities on a 14B parameter MoE model.
arXiv Detail & Related papers (2024-03-04T19:33:39Z) - Hysteretic Behavior Simulation Based on Pyramid Neural
Network:Principle, Network Architecture, Case Study and Explanation [0.0]
A surrogate model based on neural networks shows significant potential in balancing efficiency and accuracy.
Its serial information flow and prediction based on single-level features adversely affect the network performance.
A weighted stacked pyramid neural network architecture is proposed herein.
arXiv Detail & Related papers (2022-04-29T16:42:00Z) - Rethinking Architecture Selection in Differentiable NAS [74.61723678821049]
Differentiable Neural Architecture Search is one of the most popular NAS methods for its search efficiency and simplicity.
We propose an alternative perturbation-based architecture selection that directly measures each operation's influence on the supernet.
We find that several failure modes of DARTS can be greatly alleviated with the proposed selection method.
arXiv Detail & Related papers (2021-08-10T00:53:39Z) - A Semi-Supervised Assessor of Neural Architectures [157.76189339451565]
We employ an auto-encoder to discover meaningful representations of neural architectures.
A graph convolutional neural network is introduced to predict the performance of architectures.
arXiv Detail & Related papers (2020-05-14T09:02:33Z) - Residual Attention Net for Superior Cross-Domain Time Sequence Modeling [0.0]
This paper serves as a proof-of-concept for a new architecture, with RAN aiming at providing the model a higher level understanding of sequence patterns.
We have achieved 35 state-of-the-art results with 10 results matching current state-of-the-art results without further model fine-tuning.
arXiv Detail & Related papers (2020-01-13T06:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.