TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter
- URL: http://arxiv.org/abs/2209.07562v3
- Date: Sun, 27 Aug 2023 02:42:16 GMT
- Title: TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter
- Authors: Xinyang Zhang, Yury Malkov, Omar Florez, Serim Park, Brian McWilliams,
Jiawei Han, Ahmed El-Kishky
- Abstract summary: We present TwHIN-BERT, a multilingual language model productionized at Twitter.
Our model is trained on 7 billion tweets covering over 100 distinct languages.
We evaluate our model on various multilingual social recommendation and semantic understanding tasks.
- Score: 31.698196219228024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (PLMs) are fundamental for natural language
processing applications. Most existing PLMs are not tailored to the noisy
user-generated text on social media, and the pre-training does not factor in
the valuable social engagement logs available in a social network. We present
TwHIN-BERT, a multilingual language model productionized at Twitter, trained on
in-domain data from the popular social network. TwHIN-BERT differs from prior
pre-trained language models as it is trained with not only text-based
self-supervision, but also with a social objective based on the rich social
engagements within a Twitter heterogeneous information network (TwHIN). Our
model is trained on 7 billion tweets covering over 100 distinct languages,
providing a valuable representation to model short, noisy, user-generated text.
We evaluate our model on various multilingual social recommendation and
semantic understanding tasks and demonstrate significant metric improvement
over established pre-trained language models. We open-source TwHIN-BERT and our
curated hashtag prediction and social engagement benchmark datasets to the
research community.
Related papers
- Measuring Social Norms of Large Language Models [13.648679166997693]
We present a new challenge to examine whether large language models understand social norms.
Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions.
We propose a multi-agent framework based on large language models to improve the models' ability to understand social norms.
arXiv Detail & Related papers (2024-04-03T05:58:57Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - LMSOC: An Approach for Socially Sensitive Pretraining [4.857837729560728]
We propose a simple but effective approach to incorporate speaker social context into the learned representations of large-scale language models.
Our method first learns dense representations of social contexts using graph representation learning algorithms and then primes language model pretraining with these social context representations.
arXiv Detail & Related papers (2021-10-20T00:10:37Z) - Improved Multilingual Language Model Pretraining for Social Media Text
via Translation Pair Prediction [1.14219428942199]
We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus.
Our approach assumes access to translations between source-target language pairs.
We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese.
arXiv Detail & Related papers (2021-10-20T00:06:26Z) - Neural Models for Offensive Language Detection [0.0]
Offensive language detection is an ever-growing natural language processing (NLP) application.
We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis.
arXiv Detail & Related papers (2021-05-30T13:02:45Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - TweetBERT: A Pretrained Language Representation Model for Twitter Text
Analysis [0.0]
We introduce two TweetBERT models, which are domain specific language presentation models, pre-trained on millions of tweets.
We show that the TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.
arXiv Detail & Related papers (2020-10-17T00:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.