SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented
Dialogue with Symbolic Scene Representation
- URL: http://arxiv.org/abs/2307.04907v1
- Date: Mon, 10 Jul 2023 21:16:46 GMT
- Title: SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented
Dialogue with Symbolic Scene Representation
- Authors: Bhathiya Hemanthage, Christian Dondrup, Phil Bartie, Oliver Lemon
- Abstract summary: SimpleMTOD recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks.
We introduce both local and de-localized tokens for objects within a scene.
The model does not rely on task-specific architectural changes such as classification heads.
- Score: 2.4469484645516837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: SimpleMTOD is a simple language model which recasts several sub-tasks in
multimodal task-oriented dialogues as sequence prediction tasks. SimpleMTOD is
built on a large-scale transformer-based auto-regressive architecture, which
has already proven to be successful in uni-modal task-oriented dialogues, and
effectively leverages transfer learning from pre-trained GPT-2. In-order to
capture the semantics of visual scenes, we introduce both local and
de-localized tokens for objects within a scene. De-localized tokens represent
the type of an object rather than the specific object itself and so possess a
consistent meaning across the dataset. SimpleMTOD achieves a state-of-the-art
BLEU score (0.327) in the Response Generation sub-task of the SIMMC 2.0
test-std dataset while performing on par in other multimodal sub-tasks:
Disambiguation, Coreference Resolution, and Dialog State Tracking. This is
despite taking a minimalist approach for extracting visual (and non-visual)
information. In addition the model does not rely on task-specific architectural
changes such as classification heads.
Related papers
- Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations [1.1650821883155187]
We propose Contrastive $lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences.
Our method integrates the following three key types of features into a multi-level aligned representation.
We evaluate Contrastive $lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform.
arXiv Detail & Related papers (2024-10-01T06:35:34Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Referring Transformer: A One-step Approach to Multi-task Visual
Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks.
Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder.
We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z) - A Simple Language Model for Task-Oriented Dialogue [61.84084939472287]
SimpleTOD is a simple approach to task-oriented dialogue that uses a single, causal language model trained on all sub-tasks recast as a single sequence prediction problem.
This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2.
arXiv Detail & Related papers (2020-05-02T11:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.