Related papers: MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

URL: http://arxiv.org/abs/2602.24188v1
Date: Fri, 27 Feb 2026 17:13:20 GMT
Title: MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
Authors: Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata,
Abstract summary: We evaluate language models in multi-turn interactions using a suite of collaborative games that require effective communication about private information.<n>We find that language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario.<n>We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence.
Score: 70.37904949359938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.

Related papers

LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation [17.584631586928815]
We propose a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation.<n>Our framework relies on linguistically informed reasoning with minimal task-specific coupling.<n>We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.
arXiv Detail & Related papers (2026-01-08T02:30:43Z)
Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation [5.878901309908815]
We study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings.<n>We examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior.
arXiv Detail & Related papers (2026-01-07T15:51:54Z)
Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z)
Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models [49.22720751953838]
We propose a method for training language models in an interactive setting inspired by child language acquisition.<n>In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a reward if communicative success is achieved.
arXiv Detail & Related papers (2025-05-09T11:48:36Z)
From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification [21.6988262735281]
Chain-of-Intent is a novel framework that integrates Hidden Markov Models with Large Language Models to generate intent-driven, context-aware dialogues.<n> MINT-CL is a contrastive learning framework for multi-turn intent classification, which improves performance while reducing dependence on large-scale annotated datasets.
arXiv Detail & Related papers (2024-11-21T15:59:29Z)
A Comparative Analysis of Conversational Large Language Models in Knowledge-Based Text Generation [5.661396828160973]
We conduct an empirical analysis of conversational large language models in generating natural language text from semantic triples. We compare four large language models of varying sizes with different prompting techniques. Our findings show that the capabilities of large language models in triple verbalization can be significantly improved through few-shot prompting, post-processing, and efficient fine-tuning techniques.
arXiv Detail & Related papers (2024-02-02T15:26:39Z)
Pre-training Multi-party Dialogue Models with Latent Discourse Inference [85.9683181507206]
We pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model.
arXiv Detail & Related papers (2023-05-24T14:06:27Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
A Short Survey of Pre-trained Language Models for Conversational AI-A NewAge in NLP [17.10418053437171]
Recently introduced pre-trained language models have the potential to address the issue of data scarcity. These models have demonstrated to capture different facets of language such as hierarchical relations, long-term dependency, and sentiment. This paper intends to establish whether these pre-trained models can overcome the challenges pertinent to dialogue systems.
arXiv Detail & Related papers (2021-04-22T01:00:56Z)
TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue [113.45485470103762]
In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling. To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling.
arXiv Detail & Related papers (2020-04-15T04:09:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.