Reference Resolution and Context Change in Multimodal Situated Dialogue
for Exploring Data Visualizations
- URL: http://arxiv.org/abs/2209.02215v1
- Date: Tue, 6 Sep 2022 04:43:28 GMT
- Title: Reference Resolution and Context Change in Multimodal Situated Dialogue
for Exploring Data Visualizations
- Authors: Abhinav Kumar, Barbara Di Eugenio, Abari Bhattacharya, Jillian
Aurisano, Andrew Johnson
- Abstract summary: We focus on resolving references to visualizations on a large screen display in multimodal dialogue.
We describe our annotations for user references to visualizations appearing on a large screen via language and hand gesture.
We report results on detecting and resolving references, effectiveness of contextual information on the model, and under-specified requests for creating visualizations.
- Score: 3.5813777917429515
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reference resolution, which aims to identify entities being referred to by a
speaker, is more complex in real world settings: new referents may be created
by processes the agents engage in and/or be salient only because they belong to
the shared physical setting. Our focus is on resolving references to
visualizations on a large screen display in multimodal dialogue; crucially,
reference resolution is directly involved in the process of creating new
visualizations. We describe our annotations for user references to
visualizations appearing on a large screen via language and hand gesture and
also new entity establishment, which results from executing the user request to
create a new visualization. We also describe our reference resolution pipeline
which relies on an information-state architecture to maintain dialogue context.
We report results on detecting and resolving references, effectiveness of
contextual information on the model, and under-specified requests for creating
visualizations. We also experiment with conventional CRF and deep learning /
transformer models (BiLSTM-CRF and BERT-CRF) for tagging references in user
utterance text. Our results show that transfer learning significantly boost
performance of the deep learning methods, although CRF still out-performs them,
suggesting that conventional methods may generalize better for low resource
data.
Related papers
- Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries.
To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets.
We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Knowledge Graphs and Pre-trained Language Models enhanced Representation Learning for Conversational Recommender Systems [58.561904356651276]
We introduce the Knowledge-Enhanced Entity Representation Learning (KERL) framework to improve the semantic understanding of entities for Conversational recommender systems.
KERL uses a knowledge graph and a pre-trained language model to improve the semantic understanding of entities.
KERL achieves state-of-the-art results in both recommendation and response generation tasks.
arXiv Detail & Related papers (2023-12-18T06:41:23Z) - Resolving References in Visually-Grounded Dialogue via Text Generation [3.8673630752805446]
Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge.
We propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references.
We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot.
arXiv Detail & Related papers (2023-09-23T17:07:54Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - ReSee: Responding through Seeing Fine-grained Visual Knowledge in
Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity.
To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset.
By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Learning from Context or Names? An Empirical Study on Neural Relation
Extraction [112.06614505580501]
We study the effect of two main information sources in text: textual context and entity mentions (names)
We propose an entity-masked contrastive pre-training framework for relation extraction (RE)
Our framework can improve the effectiveness and robustness of neural models in different RE scenarios.
arXiv Detail & Related papers (2020-10-05T11:21:59Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.