TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
- URL: http://arxiv.org/abs/2511.16528v1
- Date: Thu, 20 Nov 2025 16:42:21 GMT
- Title: TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
- Authors: Özay Ezerceli, Mahmoud El Hussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu, Yusuf Çelebi, Yağız Asker,
- Abstract summary: We introduce TurkColBERT, the first benchmark comparing dense encoders and late-interaction models for Turkish retrieval.<n>Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers.<n>We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
Related papers
- TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish [0.7233065479782755]
TabiBERT is a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus.<n>It supports 8,192-token context length (16x original BERT), achieves up to 2.65x speedup, and reduces GPU memory consumption.<n>It attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories.
arXiv Detail & Related papers (2025-12-28T20:18:22Z) - TurkEmbed: Turkish Embedding Model on NLI & STS Tasks [0.0]
TurkEmbed is a novel Turkish language embedding model designed to outperform existing models.<n>It utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning.<n>It surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement.
arXiv Detail & Related papers (2025-11-11T15:54:52Z) - SindBERT, the Sailor: Charting the Seas of Turkish NLP [0.05570276034354691]
SindBERT is trained from scratch on 312 GB of Turkish text.<n>We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark.
arXiv Detail & Related papers (2025-10-24T11:48:49Z) - Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications [0.0]
This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications.<n>These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks.<n>Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks.
arXiv Detail & Related papers (2025-09-22T12:14:11Z) - mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.58071656545661]
mmBERT is an encoder-only language model pretrained on 3T tokens of multilingual text.<n>We add over 1700 low-resource languages to the data mix only during the decay phase.<n>We show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks.
arXiv Detail & Related papers (2025-09-08T17:08:42Z) - MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm [60.14048367611333]
MonkeyOCR is a vision-language model for document parsing.<n>It advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.
arXiv Detail & Related papers (2025-06-05T16:34:57Z) - Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval [49.1574468325115]
We introduce Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones.<n>Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10.<n>More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M) remain competitive while being over 13x smaller.
arXiv Detail & Related papers (2025-05-25T23:06:20Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.