ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
- URL: http://arxiv.org/abs/2504.08716v1
- Date: Fri, 11 Apr 2025 17:29:35 GMT
- Title: ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
- Authors: Wissam Antoun, Benoît Sagot, Djamé Seddah,
- Abstract summary: transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance.<n>Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons make it difficult to determine whether these gains are due to architectural improvements or differences in training data.
- Score: 17.306542392779445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.
Related papers
- Scaling Laws of Global Weather Models [57.27583619011988]
We investigate the relationship between model performance (validation loss) and three key factors: model size, dataset size, and compute budget.<n>Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior.<n>Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to longer training durations yields greater performance gains than increasing model size.
arXiv Detail & Related papers (2026-02-26T12:57:38Z) - Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models [18.841505010078112]
We investigate the potential of a standard patch Transformer, demonstrating that it achieves state-of-the-art zero-shot forecasting performance.<n>We conduct a comprehensive ablation study that covers model scaling, data composition, and training techniques to isolate the essential ingredients for high performance.
arXiv Detail & Related papers (2026-02-06T18:01:44Z) - Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models [23.832817775138675]
Nanbeige4-3B is a family of small-scale but high-performing language models.<n>Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models.
arXiv Detail & Related papers (2025-12-06T03:36:27Z) - Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation [0.57492870498084]
We adapt AMBER, a transformer-based model originally designed for multiband images, to the task of 3D medical datacube segmentation.<n>AMBER-AFNO achieves competitive or superior accuracy with significant gains in training efficiency, inference speed, and memory usage.
arXiv Detail & Related papers (2025-08-03T22:31:00Z) - The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains [50.66245575710432]
We show that paired preference data consisting of individually weak data points can enable gains beyond the strength of each individual data point.<n>Our work shows that models can learn surprisingly well from paired data that might typically be considered weak.
arXiv Detail & Related papers (2025-07-08T17:14:44Z) - EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding [50.29046178980637]
EpiCoDe is a method that boosts model performance in data-scarcity scenarios without extra training.<n>We show that EpiCoDe consistently outperforms existing methods with significant and robust improvement.
arXiv Detail & Related papers (2025-06-04T02:11:54Z) - Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers.
We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding.
TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z) - NeoBERT: A Next-Generation BERT [9.673882259199278]
NeoBERT is a next-generation encoder that redefines the capabilities of bidirectional models.
We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.
arXiv Detail & Related papers (2025-02-26T22:00:22Z) - Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving [55.93813178692077]
We present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms.<n>We assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction.<n>Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data.
arXiv Detail & Related papers (2024-05-27T17:59:39Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - How Effective is Task-Agnostic Data Augmentation for Pretrained
Transformers? [7.727662147015879]
Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models.
We ask how effective these techniques really are when applied to pretrained transformers.
We observe a negative result, finding that techniques which previously reported strong improvements for non-pretrained models fail to consistently improve performance for pretrained transformers.
arXiv Detail & Related papers (2020-10-05T03:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.