Modeling Explicit Concerning States for Reinforcement Learning in Visual
Dialogue
- URL: http://arxiv.org/abs/2107.05250v1
- Date: Mon, 12 Jul 2021 08:15:35 GMT
- Title: Modeling Explicit Concerning States for Reinforcement Learning in Visual
Dialogue
- Authors: Zipeng Xu, Fandong Meng, Xiaojie Wang, Duo Zheng, Chenxu Lv and Jie
Zhou
- Abstract summary: We propose Explicit Concerning States (ECS) to represent what visual contents are concerned at each round and what have been concerned throughout the Visual Dialogue.
ECS is modeled from multimodal information and is represented explicitly.
Based on ECS, we formulate two intuitive and interpretable rewards to encourage the Visual Dialogue agents to converse on diverse and informative visual information.
- Score: 43.42833961578857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To encourage AI agents to conduct meaningful Visual Dialogue (VD), the use of
Reinforcement Learning has been proven potential. In Reinforcement Learning, it
is crucial to represent states and assign rewards based on the action-caused
transitions of states. However, the state representation in previous Visual
Dialogue works uses the textual information only and its transitions are
implicit. In this paper, we propose Explicit Concerning States (ECS) to
represent what visual contents are concerned at each round and what have been
concerned throughout the Visual Dialogue. ECS is modeled from multimodal
information and is represented explicitly. Based on ECS, we formulate two
intuitive and interpretable rewards to encourage the Visual Dialogue agents to
converse on diverse and informative visual information. Experimental results on
the VisDial v1.0 dataset show our method enables the Visual Dialogue agents to
generate more visual coherent, less repetitive and more visual informative
dialogues compared with previous methods, according to multiple automatic
metrics, human study and qualitative analysis.
Related papers
- Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models [25.070424546200293]
We present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors.
Experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors.
Our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets.
arXiv Detail & Related papers (2024-07-04T03:50:30Z) - ReSee: Responding through Seeing Fine-grained Visual Knowledge in
Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity.
To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset.
By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution [79.05412803762528]
The visual dialog task requires an AI agent to interact with humans in multi-round dialogs based on a visual environment.
We propose VD-PCR, a novel framework to improve Visual Dialog understanding with Pronoun Coreference Resolution.
With the proposed implicit and explicit methods, VD-PCR achieves state-of-the-art experimental results on the VisDial dataset.
arXiv Detail & Related papers (2022-05-29T15:29:50Z) - Improving Cross-Modal Understanding in Visual Dialog via Contrastive
Learning [24.673262969986993]
We analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT.
We propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU.
arXiv Detail & Related papers (2022-04-15T02:36:52Z) - Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog [12.034554338597067]
We propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK)
In our model, the external knowledge is represented with sentence-level facts and graph-level facts.
On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features.
arXiv Detail & Related papers (2022-04-10T13:12:10Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.