TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences
- URL: http://arxiv.org/abs/2412.00539v2
- Date: Sat, 07 Dec 2024 00:00:28 GMT
- Title: TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences
- Authors: Bastián González-Bustamante,
- Abstract summary: The TextClass Benchmark project aims to provide a comprehensive, fair, and dynamic evaluation of LLMs and transformers for text classification tasks.
This evaluation spans various domains and languages in social sciences disciplines engaged in NLP and text-as-data approach.
The leaderboards present performance metrics and relative ranking using a tailored Elo rating system.
- Score: 0.0
- License:
- Abstract: The TextClass Benchmark project is an ongoing, continuous benchmarking process that aims to provide a comprehensive, fair, and dynamic evaluation of LLMs and transformers for text classification tasks. This evaluation spans various domains and languages in social sciences disciplines engaged in NLP and text-as-data approach. The leaderboards present performance metrics and relative ranking using a tailored Elo rating system. With each leaderboard cycle, novel models are added, fixed test sets can be replaced for unseen, equivalent data to test generalisation power, ratings are updated, and a Meta-Elo leaderboard combines and weights domain-specific leaderboards. This article presents the rationale and motivation behind the project, explains the Elo rating system in detail, and estimates Meta-Elo across different classification tasks in social science disciplines. We also present a snapshot of the first cycle of classification tasks on incivility data in Chinese, English, German and Russian. This ongoing benchmarking process includes not only additional languages such as Arabic, Hindi, and Spanish but also a classification of policy agenda topics, misinformation, among others.
Related papers
- FaMTEB: Massive Text Embedding Benchmark in Persian Language [9.204800002382042]
This paper introduces a comprehensive benchmark for Persian (Farsi) text embeddings built upon the Massive Text Embedding Benchmark (MTEB)
Our benchmark includes 63 datasets spanning seven different tasks.
We evaluate the performance of several Persian and multilingual embedding models in a range of tasks.
arXiv Detail & Related papers (2025-02-17T09:05:21Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers [0.05857406612420462]
Large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks.
We propose evaluating systems through a novel measure of prediction coherence.
arXiv Detail & Related papers (2021-09-10T15:04:23Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z) - TweetEval: Unified Benchmark and Comparative Evaluation for Tweet
Classification [22.265865542786084]
We propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks.
Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models.
arXiv Detail & Related papers (2020-10-23T14:11:04Z) - Rank over Class: The Untapped Potential of Ranking in Natural Language
Processing [8.637110868126546]
We argue that many tasks which are currently addressed using classification are in fact being shoehorned into a classification mould.
We propose a novel end-to-end ranking approach consisting of a Transformer network responsible for producing representations for a pair of text sequences.
In an experiment on a heavily-skewed sentiment analysis dataset, converting ranking results to classification labels yields an approximately 22% improvement over state-of-the-art text classification.
arXiv Detail & Related papers (2020-09-10T22:18:57Z) - Overview of the TREC 2019 Fair Ranking Track [65.15263872493799]
The goal of the TREC Fair Ranking track was to develop a benchmark for evaluating retrieval systems in terms of fairness to different content providers.
This paper presents an overview of the track, including the task definition, descriptions of the data and the annotation process.
arXiv Detail & Related papers (2020-03-25T21:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.