TweetEval: Unified Benchmark and Comparative Evaluation for Tweet
Classification
- URL: http://arxiv.org/abs/2010.12421v2
- Date: Mon, 26 Oct 2020 09:14:54 GMT
- Title: TweetEval: Unified Benchmark and Comparative Evaluation for Tweet
Classification
- Authors: Francesco Barbieri and Jose Camacho-Collados and Leonardo Neves and
Luis Espinosa-Anke
- Abstract summary: We propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks.
Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models.
- Score: 22.265865542786084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The experimental landscape in natural language processing for social media is
too fragmented. Each year, new shared tasks and datasets are proposed, ranging
from classics like sentiment analysis to irony detection or emoji prediction.
Therefore, it is unclear what the current state of the art is, as there is no
standardized evaluation protocol, neither a strong set of baselines trained on
such domain-specific data. In this paper, we propose a new evaluation framework
(TweetEval) consisting of seven heterogeneous Twitter-specific classification
tasks. We also provide a strong set of baselines as starting point, and compare
different language modeling pre-training strategies. Our initial experiments
show the effectiveness of starting off with existing pre-trained generic
language models, and continue training them on Twitter corpora.
Related papers
- Ensembling Finetuned Language Models for Text Classification [55.15643209328513]
Finetuning is a common practice across different communities to adapt pretrained models to particular tasks.
ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates.
We present a metadataset with predictions from five large finetuned models on six datasets and report results of different ensembling strategies.
arXiv Detail & Related papers (2024-10-25T09:15:54Z) - A Dataset and Strong Baselines for Classification of Czech News Texts [0.0]
We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets.
We define four classification tasks: news source, news category, inferred author's gender, and day of the week.
We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
arXiv Detail & Related papers (2023-07-20T07:47:08Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Are We Really Making Much Progress in Text Classification? A Comparative
Review [2.579878570919875]
This study reviews and compares methods for single-label and multi-label text classification.
Results reveal that all recently proposed graph-based and hierarchy-based methods fail to outperform pre-trained language models.
arXiv Detail & Related papers (2022-04-08T09:28:20Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers [0.05857406612420462]
Large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks.
We propose evaluating systems through a novel measure of prediction coherence.
arXiv Detail & Related papers (2021-09-10T15:04:23Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.