Which One Are You Referring To? Multimodal Object Identification in
Situated Dialogue
- URL: http://arxiv.org/abs/2302.14680v1
- Date: Tue, 28 Feb 2023 15:45:20 GMT
- Title: Which One Are You Referring To? Multimodal Object Identification in
Situated Dialogue
- Authors: Holy Lovenia, Samuel Cahyawijaya, Pascale Fung
- Abstract summary: We explore three methods to tackle the problem of interpreting multimodal inputs from conversational and situational contexts.
Our best method, scene-dialogue alignment, improves the performance by 20% F1-score compared to the SIMMC 2.1 baselines.
- Score: 50.279206765971125
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The demand for multimodal dialogue systems has been rising in various
domains, emphasizing the importance of interpreting multimodal inputs from
conversational and situational contexts. We explore three methods to tackle
this problem and evaluate them on the largest situated dialogue dataset, SIMMC
2.1. Our best method, scene-dialogue alignment, improves the performance by
~20% F1-score compared to the SIMMC 2.1 baselines. We provide analysis and
discussion regarding the limitation of our methods and the potential directions
for future works. Our code is publicly available at
https://github.com/holylovenia/multimodal-object-identification.
Related papers
- ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective.
We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way.
We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z) - MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal
Open-domain Conversation [68.53133207668856]
We introduce the MMDialog dataset to better facilitate multi-modal conversation.
MMDialog is composed of a curated set of 1.08 million real-world dialogues with 1.53 million unique images across 4,184 topics.
To build engaging dialogue system with this dataset, we propose and normalize two response producing tasks.
arXiv Detail & Related papers (2022-11-10T17:37:04Z) - Beyond the Granularity: Multi-Perspective Dialogue Collaborative
Selection for Dialogue State Tracking [18.172993687706708]
In dialogue state tracking, dialogue history is a crucial material, and its utilization varies between different models.
We propose DiCoS-DST to dynamically select the relevant dialogue contents corresponding to each slot for state updating.
Our approach achieves new state-of-the-art performance on MultiWOZ 2.1 and MultiWOZ 2.2, and achieves superior performance on multiple mainstream benchmark datasets.
arXiv Detail & Related papers (2022-05-20T10:08:45Z) - M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database [139.08528216461502]
We propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED.
M3ED contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9,082 turns and 24,449 utterances.
To the best of our knowledge, M3ED is the first multimodal emotional dialogue dataset in Chinese.
arXiv Detail & Related papers (2022-05-09T06:52:51Z) - UNITER-Based Situated Coreference Resolution with Rich Multimodal Input [9.227651071050339]
We present our work on the multimodal coreference resolution task of the Situated and Interactive Multimodal Conversation 2.0 dataset.
We propose a UNITER-based model utilizing rich multimodal context to determine whether each object in the current scene is mentioned in the current dialog turn.
Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73.3% after model ensembling.
arXiv Detail & Related papers (2021-12-07T06:31:18Z) - SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal
Conversations [9.626560177660634]
We present a new corpus for the Situated and Interactive Multimodal Conversations, SIMMC 2.0, aimed at building a successful multimodal assistant agent.
The dataset features 11K task-oriented dialogs (117K utterances) between a user and a virtual assistant on the shopping domain.
arXiv Detail & Related papers (2021-04-18T00:14:29Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Multi-View Attention Network for Visual Dialog [5.731758300670842]
It is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents.
We propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs.
MVAN effectively captures the question-relevant information from the dialog history with two complementary modules.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.