UNITER-Based Situated Coreference Resolution with Rich Multimodal Input
- URL: http://arxiv.org/abs/2112.03521v1
- Date: Tue, 7 Dec 2021 06:31:18 GMT
- Title: UNITER-Based Situated Coreference Resolution with Rich Multimodal Input
- Authors: Yichen Huang, Yuchen Wang, Yik-Cheung Tam
- Abstract summary: We present our work on the multimodal coreference resolution task of the Situated and Interactive Multimodal Conversation 2.0 dataset.
We propose a UNITER-based model utilizing rich multimodal context to determine whether each object in the current scene is mentioned in the current dialog turn.
Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73.3% after model ensembling.
- Score: 9.227651071050339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present our work on the multimodal coreference resolution task of the
Situated and Interactive Multimodal Conversation 2.0 (SIMMC 2.0) dataset as a
part of the tenth Dialog System Technology Challenge (DSTC10). We propose a
UNITER-based model utilizing rich multimodal context such as textual dialog
history, object knowledge base and visual dialog scenes to determine whether
each object in the current scene is mentioned in the current dialog turn.
Results show that the proposed approach outperforms the official DSTC10
baseline substantially, with the object F1 score boosted from 36.6% to 77.3% on
the development set, demonstrating the effectiveness of the proposed object
representations from rich multimodal input. Our model ranks second in the
official evaluation on the object coreference resolution task with an F1 score
of 73.3% after model ensembling.
Related papers
- S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Application of frozen large-scale models to multimodal task-oriented
dialogue [0.0]
We use the existing Large Language Models ENnhanced to See Framework (LENS Framework) to test the feasibility of multimodal task-oriented dialogues.
The LENS Framework has been proposed as a method to solve computer vision tasks without additional training and with fixed parameters of pre-trained models.
arXiv Detail & Related papers (2023-10-02T01:42:28Z) - Which One Are You Referring To? Multimodal Object Identification in
Situated Dialogue [50.279206765971125]
We explore three methods to tackle the problem of interpreting multimodal inputs from conversational and situational contexts.
Our best method, scene-dialogue alignment, improves the performance by 20% F1-score compared to the SIMMC 2.1 baselines.
arXiv Detail & Related papers (2023-02-28T15:45:20Z) - Modeling Long Context for Task-Oriented Dialogue State Generation [51.044300192906995]
We propose a multi-task learning model with a simple yet effective utterance tagging technique and a bidirectional language model.
Our approaches attempt to solve the problem that the performance of the baseline significantly drops when the input dialogue context sequence is long.
In our experiments, our proposed model achieves a 7.03% relative improvement over the baseline, establishing a new state-of-the-art joint goal accuracy of 52.04% on the MultiWOZ 2.0 dataset.
arXiv Detail & Related papers (2020-04-29T11:02:25Z) - Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue
System [13.687071779732285]
We propose a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos.
Our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.
arXiv Detail & Related papers (2020-01-17T09:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.