SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for
Social Media NLP Research
- URL: http://arxiv.org/abs/2310.14757v1
- Date: Mon, 23 Oct 2023 09:48:25 GMT
- Title: SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for
Social Media NLP Research
- Authors: Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Leonardo Neves,
Kiamehr Rezaee, Luis Espinosa-Anke, Jiaxin Pei, Jose Camacho-Collados
- Abstract summary: We introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval.
We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.
- Score: 33.698581876383074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite its relevance, the maturity of NLP for social media pales in
comparison with general-purpose models, metrics and benchmarks. This fragmented
landscape makes it hard for the community to know, for instance, given a task,
which is the best performing model and how it compares with others. To
alleviate this issue, we introduce a unified benchmark for NLP evaluation in
social media, SuperTweetEval, which includes a heterogeneous set of tasks and
datasets combined, adapted and constructed from scratch. We benchmarked the
performance of a wide range of models on SuperTweetEval and our results suggest
that, despite the recent advances in language modelling, social media remains
challenging.
Related papers
- BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - A Survey of Parameters Associated with the Quality of Benchmarks in NLP [24.6240575061124]
Recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task.
A potential solution to these issues -- a metric quantifying quality -- remains underexplored.
We take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark.
arXiv Detail & Related papers (2022-10-14T06:44:14Z) - TempoWiC: An Evaluation Benchmark for Detecting Meaning Shift in Social
Media [17.840417362892104]
We present TempoWiC, a new benchmark aimed at accelerating research in social media-based meaning shift.
Our results show that TempoWiC is a challenging benchmark, even for recently-released language models specialized in social media.
arXiv Detail & Related papers (2022-09-15T11:17:56Z) - How not to Lie with a Benchmark: Rearranging NLP Leaderboards [0.0]
We examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean.
We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME.
arXiv Detail & Related papers (2021-12-02T15:40:52Z) - The GEM Benchmark: Natural Language Generation, its Evaluation and
Metrics [66.96150429230035]
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics.
Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models.
arXiv Detail & Related papers (2021-02-02T18:42:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.