Related papers: Evaluating Cultural and Social Awareness of LLM Web Agents

Evaluating Cultural and Social Awareness of LLM Web Agents

URL: http://arxiv.org/abs/2410.23252v1
Date: Wed, 30 Oct 2024 17:35:44 GMT
Title: Evaluating Cultural and Social Awareness of LLM Web Agents
Authors: Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu,
Abstract summary: We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Experiments show that current LLMs perform significantly better in non-agent environments.
Score: 113.49968423990616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages -- fine-tuning on culture-specific datasets significantly enhances the agents' ability to generalize across different regions, while prompting boosts the agents' ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents' cultural and social awareness during the development cycle.

Related papers

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations [63.478832978278014]
Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability.<n>We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms.
arXiv Detail & Related papers (2026-03-02T15:04:16Z)
Generalizability of Large Language Model-Based Agents: A Comprehensive Survey [32.40919143404769]
Large Language Model (LLM)-based agents are increasingly deployed in diverse domains like web navigation and household robotics.<n>Despite growing interest, the concept of generalizability in LLM-based agents remains underdefined.<n>This survey aims to establish a foundation for principled research on building LLM-based agents that generalize reliably across diverse applications.
arXiv Detail & Related papers (2025-09-19T18:13:32Z)
Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models [12.190536939842525]
Large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems.<n>This paper formalizes the capacity to identify and adapt to the identity and characteristics of a dialogue partner.<n>We show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude.
arXiv Detail & Related papers (2025-06-28T17:22:59Z)
Contextual Experience Replay for Self-Improvement of Language Agents [47.51006612841945]
We propose Contextual Experience Replay (CER) to enable efficient self-improvement for language agents.<n>CER accumulates and synthesizes past experiences into a dynamic memory buffer.<n>We evaluate CER on the challenging WebArena and VisualWebArena benchmarks.
arXiv Detail & Related papers (2025-06-07T07:47:35Z)
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions [85.88573535033406]
CRMArena-Pro is a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings.<n>It incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments.<n>Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings.
arXiv Detail & Related papers (2025-05-24T21:33:22Z)
Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels [0.0]
Urban environments are vital to supporting human activity in public spaces. The emergence of big data, such as street view images (SVIs) combined with large language models (MLLMs) is transforming how researchers and practitioners investigate, measure, and evaluate urban environments. This study explores the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design.
arXiv Detail & Related papers (2025-04-28T09:41:17Z)
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z)
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [18.84439000902905]
SWE-Search is a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.
arXiv Detail & Related papers (2024-10-26T22:45:56Z)
Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse. Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z)
RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models [5.0741409008225755]
Large language models (LLMs) have emerged as promising tools for solving challenging robotic tasks. Most existing LLM-based agents lack the ability to retain and learn from past interactions. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents' decisions.
arXiv Detail & Related papers (2024-09-18T20:03:32Z)
Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity [0.0]
Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems. Existing evaluation frameworks often overlook the human factor and cognitive computing capabilities essential for interdependent cybersecurity. I propose OllaBench, a novel evaluation framework that assesses LLMs' accuracy, wastefulness, and consistency in answering scenario-based information security compliance and non-compliance questions.
arXiv Detail & Related papers (2024-06-11T00:35:39Z)
Towards Generalizable Agents in Text-Based Educational Environments: A Study of Integrating RL with LLMs [22.568925103893182]
We aim to enhance the generalization capabilities of agents in open-ended text-based learning environments by integrating Reinforcement Learning (RL) with Large Language Models (LLMs) We introduce PharmaSimText, a novel benchmark derived from the PharmaSim virtual pharmacy environment designed for practicing diagnostic conversations. Our results show that RL-based agents excel in task completion but lack in asking quality diagnostic questions.
arXiv Detail & Related papers (2024-04-29T14:53:48Z)
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions. CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [76.95062553043607]
evaluating large language models (LLMs) is essential for understanding their capabilities and facilitating their integration into practical applications. We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
MetaAgents: Large Language Model Based Agents for Decision-Making on Teaming [27.911816995891726]
We introduce MetaAgents, a social simulation framework populated with Large Language Models (LLMs)<n>We construct a job fair environment as a case study to scrutinize the team assembly and skill-matching behaviors of LLM-based agents.<n>Our evaluation demonstrates that LLM-based agents perform competently in making rational decisions to develop efficient teams.
arXiv Detail & Related papers (2023-10-10T10:17:58Z)
ExpeL: LLM Agents Are Experiential Learners [60.54312035818746]
We introduce the Experiential Learning (ExpeL) agent to allow learning from agent experiences without requiring parametric updates. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions.
arXiv Detail & Related papers (2023-08-20T03:03:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.