TurkishBERTweet: Fast and Reliable Large Language Model for Social Media
Analysis
- URL: http://arxiv.org/abs/2311.18063v1
- Date: Wed, 29 Nov 2023 20:22:44 GMT
- Title: TurkishBERTweet: Fast and Reliable Large Language Model for Social Media
Analysis
- Authors: Ali Najafi and Onur Varol
- Abstract summary: We introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets.
The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk.
We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets.
- Score: 4.195270491854775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Turkish is one of the most popular languages in the world. Wide us of this
language on social media platforms such as Twitter, Instagram, or Tiktok and
strategic position of the country in the world politics makes it appealing for
the social network researchers and industry. To address this need, we introduce
TurkishBERTweet, the first large scale pre-trained language model for Turkish
social media built using almost 900 million tweets. The model shares the same
architecture as base BERT model with smaller input length, making
TurkishBERTweet lighter than BERTurk and can have significantly lower inference
time. We trained our model using the same approach for RoBERTa model and
evaluated on two text classification tasks: Sentiment Classification and Hate
Speech Detection. We demonstrate that TurkishBERTweet outperforms the other
available alternatives on generalizability and its lower inference time gives
significant advantage to process large-scale datasets. We also compared our
models with the commercial OpenAI solutions in terms of cost and performance to
demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of
our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the
mentioned tasks under the MIT License to facilitate future research and
applications on Turkish social media. Our TurkishBERTweet model is available
at: https://github.com/ViralLab/TurkishBERTweet
Related papers
- SindBERT, the Sailor: Charting the Seas of Turkish NLP [0.05570276034354691]
SindBERT is trained from scratch on 312 GB of Turkish text.<n>We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark.
arXiv Detail & Related papers (2025-10-24T11:48:49Z) - Emotion Recognition for Low-Resource Turkish: Fine-Tuning BERTurk on TREMO and Testing on Xenophobic Political Discourse [0.0]
This study examines the term Sessiz Istila (Silent Invasion) on Turkish social media, highlighting the rise of anti-refugee sentiment amidst the Syrian refugee influx.<n>Using BERTurk and the TREMO dataset, we developed an advanced Emotion Recognition Model (ERM) tailored for Turkish.
arXiv Detail & Related papers (2025-05-17T22:38:18Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Introducing cosmosGPT: Monolingual Training for Turkish Language Models [0.0]
This study introduces the cosmosGPT models that we created with this alternative method.
We then introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models.
The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.
arXiv Detail & Related papers (2024-04-26T11:34:11Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - RoBERTweet: A BERT Language Model for Romanian Tweets [0.15293427903448023]
This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets.
The corpus used for pre-training the models represents a novelty for the Romanian NLP community.
Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs.
arXiv Detail & Related papers (2023-06-11T06:11:56Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - HuBERT-TR: Reviving Turkish Automatic Speech Recognition with
Self-supervised Speech Representation Learning [10.378738776547815]
We present HuBERT-TR, a speech representation model for Turkish based on HuBERT.
HuBERT-TR achieves state-of-the-art results on several Turkish ASR datasets.
arXiv Detail & Related papers (2022-10-13T19:46:39Z) - TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter [31.698196219228024]
We present TwHIN-BERT, a multilingual language model productionized at Twitter.
Our model is trained on 7 billion tweets covering over 100 distinct languages.
We evaluate our model on various multilingual social recommendation and semantic understanding tasks.
arXiv Detail & Related papers (2022-09-15T19:01:21Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - NewsBERT: Distilling Pre-trained Language Model for Intelligent News
Application [56.1830016521422]
We propose NewsBERT, which can distill pre-trained language models for efficient and effective news intelligence.
In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models.
In our experiments, NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.
arXiv Detail & Related papers (2021-02-09T15:41:12Z) - TweetBERT: A Pretrained Language Representation Model for Twitter Text
Analysis [0.0]
We introduce two TweetBERT models, which are domain specific language presentation models, pre-trained on millions of tweets.
We show that the TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.
arXiv Detail & Related papers (2020-10-17T00:45:02Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.