TWIZ-v2: The Wizard of Multimodal Conversational-Stimulus
- URL: http://arxiv.org/abs/2310.02118v2
- Date: Mon, 22 Jan 2024 14:41:43 GMT
- Title: TWIZ-v2: The Wizard of Multimodal Conversational-Stimulus
- Authors: Rafael Ferreira, Diogo Tavares, Diogo Silva, Rodrigo Val\'erio, Jo\~ao
Bordalo, In\^es Sim\~oes, Vasco Ramos, David Semedo, Jo\~ao Magalh\~aes
- Abstract summary: We describe the vision, challenges, and scientific contributions of the Task Wizard team, TWIZ, in the Alexa Prize TaskBot Challenge 2022.
Our vision, is to build TWIZ bot as an helpful, multimodal, knowledgeable, and engaging assistant that can guide users towards the successful completion of complex manual tasks.
- Score: 8.010354166991991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we describe the vision, challenges, and scientific
contributions of the Task Wizard team, TWIZ, in the Alexa Prize TaskBot
Challenge 2022. Our vision, is to build TWIZ bot as an helpful, multimodal,
knowledgeable, and engaging assistant that can guide users towards the
successful completion of complex manual tasks. To achieve this, we focus our
efforts on three main research questions: (1) Humanly-Shaped Conversations, by
providing information in a knowledgeable way; (2) Multimodal Stimulus, making
use of various modalities including voice, images, and videos; and (3)
Zero-shot Conversational Flows, to improve the robustness of the interaction to
unseen scenarios. TWIZ is an assistant capable of supporting a wide range of
tasks, with several innovative features such as creative cooking, video
navigation through voice, and the robust TWIZ-LLM, a Large Language Model
trained for dialoguing about complex manual tasks. Given ratings and feedback
provided by users, we observed that TWIZ bot is an effective and robust system,
capable of guiding users through tasks while providing several multimodal
stimuli.
Related papers
- WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Roll Up Your Sleeves: Working with a Collaborative and Engaging
Task-Oriented Dialogue System [28.75059053433368]
TacoBot is a user-centered task-oriented digital assistant.
We aim to deliver a collaborative and engaging dialogue experience.
To enhance the dialogue experience, we explore a series of data augmentation strategies.
arXiv Detail & Related papers (2023-07-29T21:37:24Z) - Few-shot Multimodal Multitask Multilingual Learning [0.0]
We propose few-shot learning for a multimodal multitask multilingual (FM3) setting by adapting pre-trained vision and language models.
FM3 learns the most prominent tasks in the vision and language domains along with their intersections.
arXiv Detail & Related papers (2023-02-19T03:48:46Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue
System [120.70726465994781]
multimodal spoken dialogue system enables telephonebased agents to interact with customers like human.
We deploy Conversation Duplex Alibaba intelligent customer service to share lessons learned in production.
Online A/B experiments show in proposed system can significantly reduce response latency by 50%.
arXiv Detail & Related papers (2022-05-30T12:41:23Z) - On Task-Level Dialogue Composition of Generative Transformer Model [9.751234480029765]
We study the effect of training human-human task-oriented dialogues towards improving the ability to compose multiple tasks on Transformer generative models.
To that end, we propose and explore two solutions: (1) creating synthetic multiple task dialogue data for training from human-human single task dialogue and (2) forcing the encoder representation to be invariant to single and multiple task dialogues using an auxiliary loss.
arXiv Detail & Related papers (2020-10-09T22:10:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.