Related papers: From Words to Amino Acids: Does the Curse of Depth Persist?

From Words to Amino Acids: Does the Curse of Depth Persist?

URL: http://arxiv.org/abs/2602.21750v1
Date: Wed, 25 Feb 2026 10:06:12 GMT
Title: From Words to Amino Acids: Does the Curse of Depth Persist?
Authors: Aleena Siji, Amir Mohammad Karimi Mamaghan, Ferdinand Kapl, Tobias Höppe, Emmanouil Angelis, Andrea Dittadi, Maurice Brenner, Michael Heinzinger, Karl Henrik Johansson, Kaitlin Maile, Johannes von Oswald, Stefan Bauer,
Abstract summary: We present a depth analysis of six popular protein language models (PLMs) across model families and scales.<n>We observe consistent depth-dependent patterns that extend prior findings on large language models (LLMs)<n>Our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.
Score: 41.90462432884248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Protein language models (PLMs) have become widely adopted as general-purpose models, demonstrating strong performance in protein engineering and de novo design. Like large language models (LLMs), they are typically trained as deep transformers with next-token or masked-token prediction objectives on massive sequence corpora and are scaled by increasing model depth. Recent work on autoregressive LLMs has identified the Curse of Depth: later layers contribute little to the final output predictions. These findings naturally raise the question of whether a similar depth inefficiency also appears in PLMs, where many widely used models are not autoregressive, and some are multimodal, accepting both protein sequence and structure as input. In this work, we present a depth analysis of six popular PLMs across model families and scales, spanning three training objectives, namely autoregressive, masked, and diffusion, and quantify how layer contributions evolve with depth using a unified set of probing- and perturbation-based measurements. Across all models, we observe consistent depth-dependent patterns that extend prior findings on LLMs: later layers depend less on earlier computations and mainly refine the final output distribution, and these effects are increasingly pronounced in deeper models. Taken together, our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.

Related papers

What Affects the Effective Depth of Large Language Models? [44.85395501835759]
We study how effective depth varies with model scale, training type, and task difficulty.<n>We find that while the number of effective layers grows with model size, the effective depth ratio remains stable.<n>Our results suggest that current LLMs underuse available depth across scales, training paradigms, and tasks of varying difficulty.
arXiv Detail & Related papers (2025-12-16T04:07:17Z)
How Do LLMs Use Their Depth? [17.148445769990907]
We show that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics.<n>We propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions.
arXiv Detail & Related papers (2025-10-21T17:59:05Z)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
Do Language Models Use Their Depth Efficiently? [61.0037917291838]
We analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
arXiv Detail & Related papers (2025-05-20T04:00:56Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
The Curse of Depth in Large Language Models [28.37870372690079]
In large language models, nearly half of the layers are less effective than expected.<n>LayerNorm Scaling (LNS) scales the variance of output of the layer normalization inversely by the square root of its depth.<n>LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance.
arXiv Detail & Related papers (2025-02-09T07:03:36Z)
DepthART: Monocular Depth Estimation as Autoregressive Refinement Task [2.3884184860468136]
We introduce DepthART - a novel training method formulated as a Depth Autoregressive Refinement Task.<n>By utilizing the model's own predictions as inputs, we frame the objective as residual minimization, effectively reducing the discrepancy between training and inference procedures.<n>When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
arXiv Detail & Related papers (2024-09-23T13:36:34Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.