Anecdoctoring: Automated Red-Teaming Across Language and Place
- URL: http://arxiv.org/abs/2509.19143v1
- Date: Tue, 23 Sep 2025 15:26:13 GMT
- Title: Anecdoctoring: Automated Red-Teaming Across Language and Place
- Authors: Alejandro Cuevas, Saloni Dash, Bharat Kumar Nayak, Dan Vann, Madeleine I. G. Daepp,
- Abstract summary: "anecdoctoring" is a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures.<n>We collect misinformation claims from fact-checking websites in three languages and two geographies.<n>Our method produces higher attack success rates and offers interpretability benefits.
- Score: 38.8362654856964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose "anecdoctoring", a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.
Related papers
- MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z) - Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis [0.0]
Cyber threat detection has become an important area of focus in today's digital age.<n>This study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models.<n>We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic.
arXiv Detail & Related papers (2025-02-04T03:46:24Z) - Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages [13.011117871938561]
AI-driven moderation systems struggle with low-resource languages spoken in the Global South.<n>Our findings reveal that beyond data scarcity, socio-political factors such as tech companies' monopoly on user data exacerbate historic inequities.<n>We argue these limitations are not just technical gaps caused by "data scarcity" but reflect structural inequities, rooted in colonial suppression of non-Western languages.
arXiv Detail & Related papers (2025-01-23T17:01:53Z) - Against The Achilles' Heel: A Survey on Red Teaming for Generative Models [60.21722603260243]
Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models.
We have developed the "searcher" framework to unify various automatic red teaming approaches.
arXiv Detail & Related papers (2024-03-31T09:50:39Z) - Towards Red Teaming in Multimodal and Multilingual Translation [7.440772334845366]
This paper presents the first study on human-based red teaming for Machine Translation.
It marks a significant step towards understanding and improving the performance of translation models.
We report lessons learned and provide recommendations for both translation models and red teaming drills.
arXiv Detail & Related papers (2024-01-29T15:49:40Z) - Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive
Language Detection [19.399281609371258]
Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results.
We resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection.
arXiv Detail & Related papers (2023-11-03T16:51:07Z) - Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
and Lessons Learned [10.836210010868932]
We investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types.
We release our dataset of 38,961 red team attacks for others to analyze and learn from.
arXiv Detail & Related papers (2022-08-23T23:37:14Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.