TurkEmbed: Turkish Embedding Model on NLI & STS Tasks
- URL: http://arxiv.org/abs/2511.08376v1
- Date: Wed, 12 Nov 2025 01:56:21 GMT
- Title: TurkEmbed: Turkish Embedding Model on NLI & STS Tasks
- Authors: Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç,
- Abstract summary: TurkEmbed is a novel Turkish language embedding model designed to outperform existing models.<n>It utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning.<n>It surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4\% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.
Related papers
- Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers [0.0]
This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak.<n>We trained several machine learning models using outputs from traditional algorithms as features.<n>We also evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI's embedding models, GPT-4 model, and pretrained SlovakBERT model.
arXiv Detail & Related papers (2026-02-04T15:35:16Z) - TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task [0.0]
We introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model.<n>Our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset.
arXiv Detail & Related papers (2025-11-10T20:08:09Z) - Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications [0.0]
This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications.<n>These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks.<n>Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks.
arXiv Detail & Related papers (2025-09-22T12:14:11Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [64.1520245849231]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages.<n>We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios.<n>While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Fine-tuning Transformer-based Encoder for Turkish Language Understanding
Tasks [0.0]
We provide a Transformer-based model and a baseline benchmark for the Turkish Language.
We successfully fine-tuned a Turkish BERT model, namely BERTurk, to many downstream tasks and evaluated with a the Turkish Benchmark dataset.
arXiv Detail & Related papers (2024-01-30T19:27:04Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Improving Massively Multilingual ASR With Auxiliary CTC Objectives [40.10307386370194]
We introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark.
We investigate techniques inspired from recent Connectionist Temporal Classification ( CTC) studies to help the model handle the large number of languages.
Our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER.
arXiv Detail & Related papers (2023-02-24T18:59:51Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Structured Prediction as Translation between Augmented Natural Languages [109.50236248762877]
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks.
Instead of tackling the problem by training task-specific discriminatives, we frame it as a translation task between augmented natural languages.
Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction.
arXiv Detail & Related papers (2021-01-14T18:32:21Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.