TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish
- URL: http://arxiv.org/abs/2512.23065v3
- Date: Mon, 05 Jan 2026 10:15:27 GMT
- Title: TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish
- Authors: Melikşah Türker, A. Ebrar Kızıloğlu, Onur Güngör, Susan Üsküdarlı,
- Abstract summary: TabiBERT is a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus.<n>It supports 8,192-token context length (16x original BERT), achieves up to 2.65x speedup, and reduces GPU memory consumption.<n>It attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories.
- Score: 0.7233065479782755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch, incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). It supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories, with particularly strong gains on question answering (+9.55 points), code retrieval (+2.41 points), and academic understanding (+0.66 points). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.
Related papers
- TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval [0.0]
We introduce TurkColBERT, the first benchmark comparing dense encoders and late-interaction models for Turkish retrieval.<n>Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers.<n>We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains.
arXiv Detail & Related papers (2025-11-20T16:42:21Z) - SindBERT, the Sailor: Charting the Seas of Turkish NLP [0.05570276034354691]
SindBERT is trained from scratch on 312 GB of Turkish text.<n>We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark.
arXiv Detail & Related papers (2025-10-24T11:48:49Z) - The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation [2.497854684676663]
We present a comprehensive study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB.<n>Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers.
arXiv Detail & Related papers (2025-08-20T13:53:13Z) - GeistBERT: Breathing Life into German NLP [0.22099217573031676]
GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus.<n>The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens.<n>It achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification.
arXiv Detail & Related papers (2025-06-13T15:53:17Z) - Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval [49.1574468325115]
We introduce Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones.<n>Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10.<n>More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M) remain competitive while being over 13x smaller.
arXiv Detail & Related papers (2025-05-25T23:06:20Z) - MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining [10.421048804389343]
We introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.
When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20.
This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.
arXiv Detail & Related papers (2023-12-29T06:05:19Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.