Related papers: MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

URL: http://arxiv.org/abs/2511.12586v1
Date: Sun, 16 Nov 2025 13:08:03 GMT
Title: MMWOZ: Building Multimodal Agent for Task-oriented Dialogue
Authors: Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu,
Abstract summary: We develop a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset.<n>We propose a novel multimodal model called MATE as the baseline model for the MMWOZ dataset.
Score: 61.816787158531874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

Related papers

Generative Interfaces for Language Models [70.25765232527762]
We propose a paradigm in which large language models (LLMs) respond to user queries by proactively generating user interfaces (UIs)<n>Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs.<n>Results show that generative interfaces consistently outperform conversational ones, with up to a 72% improvement in human preference.
arXiv Detail & Related papers (2025-08-26T17:43:20Z)
Multi-User MultiWOZ: Task-Oriented Dialogues among Multiple Users [51.34484827552774]
We release the Multi-User MultiWOZ dataset: task-oriented dialogues among two users and one agent. These dialogues reflect interesting dynamics of collaborative decision-making in task-oriented scenarios. We propose a novel task of multi-user contextual query rewriting: to rewrite a task-oriented chat between two users as a concise task-oriented query.
arXiv Detail & Related papers (2023-10-31T14:12:07Z)
Leveraging Explicit Procedural Instructions for Data-Efficient Action Prediction [5.448684866061922]
Task-oriented dialogues often require agents to enact complex, multi-step procedures in order to meet user requests. Large language models have found success automating these dialogues in constrained environments, but their widespread deployment is limited by the substantial quantities of task-specific data required for training. This paper presents a data-efficient solution to constructing dialogue systems, leveraging explicit instructions derived from agent guidelines.
arXiv Detail & Related papers (2023-06-06T18:42:08Z)
Using Textual Interface to Align External Knowledge for End-to-End Task-Oriented Dialogue Systems [53.38517204698343]
We propose a novel paradigm that uses a textual interface to align external knowledge and eliminate redundant processes. We demonstrate our paradigm in practice through MultiWOZ-Remake, including an interactive textual interface built for the MultiWOZ database.
arXiv Detail & Related papers (2023-05-23T05:48:21Z)
Dialog2API: Task-Oriented Dialogue with API Description and Example Programs [57.336201096903466]
We introduce a new paradigm for task-oriented dialogue - Dialog2API - to greatly expand the functionality and provide seamless dialogue experience. The model also manages the dialogue policy and interact with the user through generating appropriate natural language responses. Dialog2API can work with many application scenarios such as software automation and customer service.
arXiv Detail & Related papers (2022-12-20T01:52:46Z)
Navigating Connected Memories with a Task-oriented Dialog System [13.117491508194242]
We propose dialogs for connected memories as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation. We use a new task-oriented dialog dataset COMET, which contains $11.5k$ user->assistant dialogs (totaling $103k$ utterances) grounded in simulated personal memory graphs. We analyze COMET, formulate four main tasks to benchmark meaningful progress, and adopt state-of-the-art language models as strong baselines.
arXiv Detail & Related papers (2022-11-15T19:31:57Z)
Manual-Guided Dialogue for Flexible Conversational Agents [84.46598430403886]
How to build and use dialogue data efficiently, and how to deploy models in different domains at scale can be critical issues in building a task-oriented dialogue system. We propose a novel manual-guided dialogue scheme, where the agent learns the tasks from both dialogue and manuals. Our proposed scheme reduces the dependence of dialogue models on fine-grained domain ontology, and makes them more flexible to adapt to various domains.
arXiv Detail & Related papers (2022-08-16T08:21:12Z)
Situated and Interactive Multimodal Conversations [21.391260370502224]
We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents. We provide two SIMMC datasets totalling 13K human-human dialogs (169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup. We present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation.
arXiv Detail & Related papers (2020-06-02T09:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.