TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task
- URL: http://arxiv.org/abs/2511.07595v1
- Date: Wed, 12 Nov 2025 01:06:07 GMT
- Title: TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task
- Authors: Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç,
- Abstract summary: We introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model.<n>Our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model originally designed for Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. By fine-tuning the base model on the MS MARCO TR dataset using advanced training techniques, including Matryoshka representation learning and a tailored multiple negatives ranking loss, we achieve SOTA performance for Turkish retrieval tasks. Extensive experiments demonstrate that our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset, thereby establishing a new benchmark for Turkish information retrieval.
Related papers
- BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish [0.0]
We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark.<n>BirderTurk is constructed through a controlled translation pipeline that adapts schema identifiers to Turkish.<n>We evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning.
arXiv Detail & Related papers (2026-02-03T15:21:00Z) - TurkEmbed: Turkish Embedding Model on NLI & STS Tasks [0.0]
TurkEmbed is a novel Turkish language embedding model designed to outperform existing models.<n>It utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning.<n>It surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement.
arXiv Detail & Related papers (2025-11-11T15:54:52Z) - A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs [0.0]
We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool.<n>We then evaluate the performance of standard In-Context Learning with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts.<n>For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions.
arXiv Detail & Related papers (2025-09-26T05:44:04Z) - Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications [0.0]
This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications.<n>These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks.<n>Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks.
arXiv Detail & Related papers (2025-09-22T12:14:11Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [64.1520245849231]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Fine-tuning Transformer-based Encoder for Turkish Language Understanding
Tasks [0.0]
We provide a Transformer-based model and a baseline benchmark for the Turkish Language.
We successfully fine-tuned a Turkish BERT model, namely BERTurk, to many downstream tasks and evaluated with a the Turkish Benchmark dataset.
arXiv Detail & Related papers (2024-01-30T19:27:04Z) - RoBERTurk: Adjusting RoBERTa for Turkish [0.0]
We pretrain RoBERTa on a Turkish corpora using BPE tokenizer.
Our model outperforms BERTurk family models on the BOUN dataset for the POS task while resulting in underperformance on the IMST dataset for the same task and achieving competitive scores on the Turkish split of the XTREME dataset for the NER task.
arXiv Detail & Related papers (2024-01-07T15:13:24Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - RoBLEURT Submission for the WMT2021 Metrics Task [72.26898579202076]
We present our submission to the Shared Metrics Task: RoBLEURT.
Our model reaches state-of-the-art correlations with the WMT 2020 human annotations upon 8 out of 10 to-English language pairs.
arXiv Detail & Related papers (2022-04-28T08:49:40Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - Conversational Question Reformulation via Sequence-to-Sequence
Architectures and Pretrained Language Models [56.268862325167575]
This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs)
We leverage PLMs to address the strong token-to-token independence assumption made in the common objective, maximum likelihood estimation, for the CQR task.
We evaluate fine-tuned PLMs on the recently-introduced CANARD dataset as an in-domain task and validate the models using data from the TREC 2019 CAsT Track as an out-domain task.
arXiv Detail & Related papers (2020-04-04T11:07:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.