Related papers: Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction

Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction

URL: http://arxiv.org/abs/2603.01423v1
Date: Mon, 02 Mar 2026 03:59:40 GMT
Title: Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
Authors: Jiyoon Myung,
Abstract summary: Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations.<n>We conduct a systematic evaluation of conversational reliability through three representative tasks.<n>We observe substantial declines in reliability, particularly for smaller models.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.

Related papers

Confidence Estimation for LLMs in Multi-turn Interactions [48.081802290688394]
This work presents the first systematic study of confidence estimation in multi-turn interactions.<n>We establish a formal evaluation framework grounded in two key desideratas: per-turn calibration and monotonicity of confidence.<n>Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
arXiv Detail & Related papers (2026-01-05T14:58:04Z)
Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents [0.4666493857924358]
Multi-turn tool-calling LLMs have emerged as a key feature in modern AI assistants.<n>Implementing multi-turn pipelines remains difficult for many safety-critical industries.<n>There is still a lack of visibility into multi-turn conversation-level robustness.
arXiv Detail & Related papers (2025-11-29T05:44:37Z)
Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation [60.63465682731118]
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
arXiv Detail & Related papers (2025-11-12T04:28:14Z)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z)
Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs [21.192619293355502]
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments.<n>This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios.
arXiv Detail & Related papers (2025-08-13T19:14:45Z)
Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z)
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models [15.158475816860427]
Uncertainty is essential for assessing the reliability and trustworthiness of modern AI systems.<n> verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution.<n>However, its effectiveness in vision-language models (VLMs) remains insufficiently studied.
arXiv Detail & Related papers (2025-05-26T17:16:36Z)
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z)
Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances. We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.