Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2506.10504v1
- Date: Thu, 12 Jun 2025 09:04:19 GMT
- Title: Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models
- Authors: Sangmin Song, Juhwan Choi, JungMin Yun, YoungBin Kim,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST)<n>In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs.
- Score: 7.5972186611957815
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.
Related papers
- IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - LLMs Get Lost In Multi-Turn Conversation [44.26588510453331]
Large Language Models (LLMs) are conversational interfaces.<n>LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange.
arXiv Detail & Related papers (2025-05-09T15:21:44Z) - Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models [0.0]
We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user interaction.
We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents.
arXiv Detail & Related papers (2024-09-30T12:01:29Z) - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses.
To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback.
We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.<n>We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Synergizing In-context Learning with Hints for End-to-end Task-oriented Dialog Systems [25.14460456391397]
Large language model (LLM) based TOD systems can excel even with limited data due to their ability to learn tasks through in-context exemplars.
We propose SyncTOD that synergizes LLMs with task-specific hints to improve alignment in low-data settings.
With ChatGPT, SyncTOD achieves superior performance compared to LLM-based baselines and SoTA models in low-data settings.
arXiv Detail & Related papers (2024-05-24T14:13:54Z) - Enhancing Dialogue State Tracking Models through LLM-backed User-Agents Simulation [12.93942316816741]
GPT-4 is used to simulate the user and agent interaction, generating thousands of annotated dialogues with DST labels.
A two-stage fine-tuning on LLaMA 2 is performed on the generated data and the real data for the DST prediction.
Our approach is also capable of adapting to the dynamic demands in real-world scenarios, generating dialogues in new domains swiftly.
arXiv Detail & Related papers (2024-05-17T07:00:05Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - In-Context Learning for Few-Shot Dialogue State Tracking [55.91832381893181]
We propose an in-context (IC) learning framework for few-shot dialogue state tracking (DST)
A large pre-trained language model (LM) takes a test instance and a few annotated examples as input, and directly decodes the dialogue states without any parameter updates.
This makes the LM more flexible and scalable compared to prior few-shot DST work when adapting to new domains and scenarios.
arXiv Detail & Related papers (2022-03-16T11:58:24Z) - Prompt Learning for Few-Shot Dialogue State Tracking [75.50701890035154]
This paper focuses on how to learn a dialogue state tracking (DST) model efficiently with limited labeled data.
We design a prompt learning framework for few-shot DST, which consists of two main components: value-based prompt and inverse prompt mechanism.
Experiments show that our model can generate unseen slots and outperforms existing state-of-the-art few-shot methods.
arXiv Detail & Related papers (2022-01-15T07:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.