Related papers: Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

URL: http://arxiv.org/abs/2511.10691v1
Date: Wed, 12 Nov 2025 06:06:29 GMT
Title: Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
Authors: Zijian Chen, Wenjun Zhang, Guangtao Zhai,
Abstract summary: We introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings.<n>We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios.
Score: 57.33350664910483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contemporary benchmarks are struggling to keep pace with the development of large language models (LLMs). Although they are indispensable to evaluate model performance on various tasks, it is uncertain whether the models trained on Internet data have genuinely learned how to solve problems or merely seen the questions before. This potential data contamination issue presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, existing benchmarks predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and Squid Game with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. The code and data will be released in the future.

Related papers

Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games [0.0]
Large Language Models (LLMs) excel at many tasks but struggle with a critical ability for resolving ambiguity in user requests.<n>We propose a multi-turn dialogue framework that quantitatively measures how effectively LLMs gather information through yes/no questions.<n>Our experiments demonstrate that, among the evaluated models, the ones with explicit reasoning capabilities achieve higher IG per turn and reach solutions in fewer steps.
arXiv Detail & Related papers (2026-01-25T06:38:15Z)
BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors [9.224594551677374]
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making.<n>Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools.<n>Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement.
arXiv Detail & Related papers (2026-01-22T13:15:08Z)
What-If Analysis of Large Language Models: Explore the Game World Using Proactive Thinking [50.72154186522052]
Large language models (LLMs) excel at processing information reactively but lack the ability to systemically explore hypothetical futures.<n>We propose WiA-LLM, a new paradigm that equips LLMs with proactive thinking capabilities.<n>We validate WiA-LLM in Honor of Kings, a complex multiplayer game environment.
arXiv Detail & Related papers (2025-09-05T04:05:27Z)
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation [78.96590724864606]
We introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium.<n>KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios.
arXiv Detail & Related papers (2025-05-20T16:06:32Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing [39.6507632134755]
We propose GETA, a novel generative evolving testing approach based on adaptive testing methods in measurement theory.<n> GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability.<n> GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity.
arXiv Detail & Related papers (2024-06-20T11:51:00Z)
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models [28.305932427801682]
We present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of VidLMs on a firm footing. ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images.
arXiv Detail & Related papers (2023-11-13T02:13:13Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.