Related papers: ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour

ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour

URL: http://arxiv.org/abs/2506.12090v1
Date: Wed, 11 Jun 2025 14:29:43 GMT
Title: ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
Authors: Jack Contro, Simrat Deol, Yulan He, Martim Brandão,
Abstract summary: Large Language Models (LLMs) can be manipulative when explicitly instructed.<n>Small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification.<n>Our work highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
Score: 11.86454511458083
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly gaslighting and fear enhancement. Third, small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification with larger models like Gemini 2.5 pro in detecting manipulation, but are not yet reliable for real-world oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.

Related papers

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.<n>CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z)
RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts [6.0385743836962025]
RICoTA is a Korean red teaming dataset that consists of 609 prompts challenging large language models (LLMs)<n>We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community.<n>Our dataset will be made publicly available via GitHub.
arXiv Detail & Related papers (2025-01-29T15:32:27Z)
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards [93.16294577018482]
Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models.<n>We show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes.<n>Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95%$ accuracy; and then, the attacker can use this information to consistently vote against a target model.
arXiv Detail & Related papers (2025-01-13T17:12:38Z)
LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction. Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z)
Zero-shot Persuasive Chatbots with LLM-Generated Strategies and Information Retrieval [7.925754291635035]
Persuasion plays a pivotal role in a wide range of applications from health intervention to the promotion of social good. Persuasive chatbots employed responsibly for social good can be an enabler of positive individual and social change. We propose PersuaBot, a zero-shot chatbots based on Large Language Models (LLMs) that is factual and more persuasive by leveraging many more nuanced strategies. Our experiments on simulated and human conversations show that our zero-shot approach is more persuasive than prior work, while achieving factual accuracy surpassing state-of-the-art knowledge-oriented chatbots.
arXiv Detail & Related papers (2024-07-04T02:28:21Z)
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs [72.38330196290119]
System-prompting is a tool for customizing language-model chatbots, enabling them to follow a specific instruction. We propose a benchmark to test the assumption, evaluating instruction stability via self-chats. We reveal a significant instruction drift within eight rounds of conversations. We propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
arXiv Detail & Related papers (2024-02-13T20:10:29Z)
Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary. We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z)
A Deep Learning Approach to Integrate Human-Level Understanding in a Chatbot [0.4632366780742501]
Unlike humans, chatbots can serve multiple customers at a time, are available 24/7 and reply in less than a fraction of a second. We performed sentiment analysis, emotion detection, intent classification and named-entity recognition using deep learning to develop chatbots with humanistic understanding and intelligence.
arXiv Detail & Related papers (2021-12-31T22:26:41Z)
Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions. Our framework included a guiding robot and an interlocutor model that plays the role of humans. We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z)
Personalized Chatbot Trustworthiness Ratings [19.537492400265577]
We envision a personalized rating methodology for chatbots that relies on separate rating modules for each issue. The method is independent of the specific trust issues and is parametric to the aggregation procedure.
arXiv Detail & Related papers (2020-05-13T22:42:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.