RoBERTurk: Adjusting RoBERTa for Turkish
- URL: http://arxiv.org/abs/2401.03515v1
- Date: Sun, 7 Jan 2024 15:13:24 GMT
- Title: RoBERTurk: Adjusting RoBERTa for Turkish
- Authors: Nuri Tas
- Abstract summary: We pretrain RoBERTa on a Turkish corpora using BPE tokenizer.
Our model outperforms BERTurk family models on the BOUN dataset for the POS task while resulting in underperformance on the IMST dataset for the same task and achieving competitive scores on the Turkish split of the XTREME dataset for the NER task.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We pretrain RoBERTa on a Turkish corpora using BPE tokenizer. Our model
outperforms BERTurk family models on the BOUN dataset for the POS task while
resulting in underperformance on the IMST dataset for the same task and
achieving competitive scores on the Turkish split of the XTREME dataset for the
NER task - all while being pretrained on smaller data than its competitors. We
release our pretrained model and tokenizer.
Related papers
- BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish [0.0]
We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark.<n>BirderTurk is constructed through a controlled translation pipeline that adapts schema identifiers to Turkish.<n>We evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning.
arXiv Detail & Related papers (2026-02-03T15:21:00Z) - TurkEmbed: Turkish Embedding Model on NLI & STS Tasks [0.0]
TurkEmbed is a novel Turkish language embedding model designed to outperform existing models.<n>It utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning.<n>It surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement.
arXiv Detail & Related papers (2025-11-11T15:54:52Z) - TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task [0.0]
We introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model.<n>Our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset.
arXiv Detail & Related papers (2025-11-10T20:08:09Z) - Estimating Time Series Foundation Model Transferability via In-Context Learning [74.65355820906355]
Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training.<n>Fine-tuning remains critical for boosting performance in domains with limited public data.<n>We introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem.
arXiv Detail & Related papers (2025-09-28T07:07:13Z) - Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data [38.08600450054975]
We show that this performance can be significantly boosted by a targeted continued pre-training phase.<n>We demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior predictive downstream accuracy.<n>Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
arXiv Detail & Related papers (2025-07-05T09:39:07Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi [0.7874708385247353]
This work focuses on two low-resource Indian languages, Hindi and Marathi.
We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation.
We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
arXiv Detail & Related papers (2022-11-21T05:15:48Z) - HuBERT-TR: Reviving Turkish Automatic Speech Recognition with
Self-supervised Speech Representation Learning [10.378738776547815]
We present HuBERT-TR, a speech representation model for Turkish based on HuBERT.
HuBERT-TR achieves state-of-the-art results on several Turkish ASR datasets.
arXiv Detail & Related papers (2022-10-13T19:46:39Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z) - Towards Efficient NLP: A Standard Evaluation and A Strong Baseline [55.29756535335831]
This work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models.
Along with the benchmark, we also pre-train and release a strong baseline, ElasticBERT, whose elasticity is both static and dynamic.
arXiv Detail & Related papers (2021-10-13T21:17:15Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - An Empirical Study of Using Pre-trained BERT Models for Vietnamese
Relation Extraction Task at VLSP 2020 [0.0]
We apply two state-of-the-art BERT-based models: R-BERT and BERT model with entity starts.
For each model, we compared two pre-trained BERT models: FPTAI/vibert and NlpHUST/vibert4news.
We found that NlpHUST/vibert4news model significantly outperforms FPTAI/vibert for the Vietnamese relation extraction task.
arXiv Detail & Related papers (2020-12-18T14:53:49Z) - GottBERT: a pure German Language Model [0.0]
No German single language RoBERTa model is yet published, which we introduce in this work (GottBERT)
In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones.
GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture.
arXiv Detail & Related papers (2020-12-03T17:45:03Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z) - Application of Pre-training Models in Named Entity Recognition [5.285449619478964]
We introduce the architecture and pre-training tasks of four common pre-training models: BERT, ERNIE, ERNIE2.0-tiny, and RoBERTa.
We apply these pre-training models to a NER task by fine-tuning, and compare the effects of the different model architecture and pre-training tasks on the NER task.
Experiment results showed that RoBERTa achieved state-of-the-art results on the MSRA-2006 dataset.
arXiv Detail & Related papers (2020-02-09T08:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.