Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
- URL: http://arxiv.org/abs/2503.06263v1
- Date: Sat, 08 Mar 2025 16:19:13 GMT
- Title: Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
- Authors: Benjamin Jensen, Ian Reynolds, Yasir Atalan, Michael Garcia, Austin Woo, Anthony Chen, Trevor Howarth,
- Abstract summary: This study presents a novel benchmark designed to evaluate the biases and preferences of seven prominent foundation models.<n>We used 400-expert crafted scenarios to analyze results from our selected models.<n>All models exhibit some degree of country-specific biases, often recommending less escalatory and interventionist actions for China and Russia.
- Score: 2.11457423143017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As national security institutions increasingly integrate Artificial Intelligence (AI) into decision-making and content generation processes, understanding the inherent biases of large language models (LLMs) is crucial. This study presents a novel benchmark designed to evaluate the biases and preferences of seven prominent foundation models-Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, GPT-4o, Gemini 1.5 Pro-002, Mixtral 8x22B, Claude 3.5 Sonnet, and Qwen2 72B-in the context of international relations (IR). We designed a bias discovery study around core topics in IR using 400-expert crafted scenarios to analyze results from our selected models. These scenarios focused on four topical domains including: military escalation, military and humanitarian intervention, cooperative behavior in the international system, and alliance dynamics. Our analysis reveals noteworthy variation among model recommendations based on scenarios designed for the four tested domains. Particularly, Qwen2 72B, Gemini 1.5 Pro-002 and Llama 3.1 8B Instruct models offered significantly more escalatory recommendations than Claude 3.5 Sonnet and GPT-4o models. All models exhibit some degree of country-specific biases, often recommending less escalatory and interventionist actions for China and Russia compared to the United States and the United Kingdom. These findings highlight the necessity for controlled deployment of LLMs in high-stakes environments, emphasizing the need for domain-specific evaluations and model fine-tuning to align with institutional objectives.
Related papers
- IHEval: Evaluating Language Models on Following the Instruction Hierarchy [67.33509094445104]
The instruction hierarchy establishes a priority order from system messages to user messages, conversation history, and tool outputs.
Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy.
We bridge this gap by introducing IHEval, a novel benchmark covering cases where instructions in different priorities either align or conflict.
arXiv Detail & Related papers (2025-02-12T19:35:28Z) - Unraveling the Capabilities of Language Models in News Summarization [0.0]
This work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task.<n>We focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology.<n>We highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities.
arXiv Detail & Related papers (2025-01-30T04:20:16Z) - Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field [0.0]
This paper offers an analysis of the ability of large models to identify semantic relationships between different research topics.<n>We developed a gold standard based on the IEEE Thesaurus to evaluate the task.<n>Several models have achieved outstanding results, including Mixtral-8x7B, Dolphin-Mistral, and Claude 3-7B.
arXiv Detail & Related papers (2024-12-11T10:11:41Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on
Climate Change [21.827936253363603]
This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change.
We trained two 7B models from scratch on a science-oriented dataset of 300B tokens.
ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama2 on a domain-specific dataset of 4.2B tokens.
arXiv Detail & Related papers (2024-01-17T23:29:46Z) - Escalation Risks from Language Models in Military and Diplomatic
Decision-Making [0.0]
This work aims to scrutinize the behavior of multiple AI agents in simulated wargames.
We design a novel wargame simulation and scoring framework to assess the risks of the escalation of actions taken by these agents.
We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons.
arXiv Detail & Related papers (2024-01-07T07:59:10Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.