Related papers: Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

URL: http://arxiv.org/abs/2501.07493v1
Date: Mon, 13 Jan 2025 17:12:38 GMT
Title: Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang,
Abstract summary: Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models.<n>We show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes.<n>Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95%$ accuracy; and then, the attacker can use this information to consistently vote against a target model.
Score: 93.16294577018482
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.

Related papers

Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models [66.51871176061195]
Decentralized Arena (dearena) is a fully automated framework leveraging collective intelligence from all large language models to evaluate each other.<n> dearena attains up to 97% correlation with human judgements, while significantly reducing the cost.
arXiv Detail & Related papers (2025-05-19T07:34:25Z)
Improving Your Model Ranking on Chatbot Arena by Vote Rigging [43.28854307528825]
We show that crowdsourced voting can be rigged to improve the ranking of a target model $m_t$. We conduct experiments on around $1.7$ million votes from the Elo Arena platform. Our findings highlight the importance of continued efforts to prevent vote rigging.
arXiv Detail & Related papers (2025-01-29T18:57:29Z)
Adversarial Botometer: Adversarial Analysis for Social Bot Detection [1.9280536006736573]
Social bots produce content that mimics human creativity. Malicious social bots emerge to deceive people with their unrealistic content. We evaluate the behavior of a text-based bot detector in a competitive environment.
arXiv Detail & Related papers (2024-05-03T11:28:21Z)
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference [48.99117537559644]
We introduce Arena, an open platform for evaluating Large Language Models (LLMs) based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using.
arXiv Detail & Related papers (2024-03-07T01:22:38Z)
My Brother Helps Me: Node Injection Based Adversarial Attack on Social Bot Detection [69.99192868521564]
Social platforms such as Twitter are under siege from a multitude of fraudulent users. Due to the structure of social networks, the majority of methods are based on the graph neural network(GNN), which is susceptible to attacks. We propose a node injection-based adversarial attack method designed to deceive bot detection models.
arXiv Detail & Related papers (2023-10-11T03:09:48Z)
Backdoor Attacks on Crowd Counting [63.90533357815404]
Crowd counting is a regression task that estimates the number of people in a scene image. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks.
arXiv Detail & Related papers (2022-07-12T16:17:01Z)
Dictionary Attacks on Speaker Verification [15.00667613025837]
We introduce a generic formulation of the attack that can be used with various speech representations and threat models. The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population. We show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems.
arXiv Detail & Related papers (2022-04-24T15:31:41Z)
Identification of Twitter Bots based on an Explainable ML Framework: the US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data. Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm. Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z)
Adversarial Attacks on ML Defense Models Competition [82.37504118766452]
The TSAIL group at Tsinghua University and the Alibaba Security group organized this competition. The purpose of this competition is to motivate novel attack algorithms to evaluate adversarial robustness.
arXiv Detail & Related papers (2021-10-15T12:12:41Z)
Multi-granularity Textual Adversarial Attack with Behavior Cloning [4.727534308759158]
We propose MAYA, a Multi-grAnularitY Attack model to generate high-quality adversarial samples with fewer queries to victim models. We conduct comprehensive experiments to evaluate our attack models by attacking BiLSTM, BERT and RoBERTa in two different black-box attack settings and three benchmark datasets.
arXiv Detail & Related papers (2021-09-09T15:46:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.