Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
- URL: http://arxiv.org/abs/2310.07259v2
- Date: Wed, 22 May 2024 11:58:12 GMT
- Title: Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
- Authors: Haoyu Zhang, Meng Liu, Yaowei Wang, Da Cao, Weili Guan, Liqiang Nie,
- Abstract summary: Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
- Score: 83.63849872250651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast to conventional visual question answering, video-grounded dialog necessitates a profound understanding of both dialog history and video content for accurate response generation. Despite commendable progress made by existing approaches, they still face the challenges of incrementally understanding complex dialog history and assimilating video information. In response to these challenges, we present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator. Specifically, we devise a path search and aggregation strategy in the textual encoder, mining core cues from dialog history that are pivotal to understanding the posed questions. Concurrently, our visual encoder harnesses an iterative reasoning network to extract and emphasize critical visual markers from videos, enhancing the depth of visual comprehension. Finally, we utilize the pre-trained GPT-2 model as our answer generator to decode the mined hidden clues into coherent and contextualized answers. Extensive experiments on three public datasets demonstrate the effectiveness and generalizability of our proposed framework.
Related papers
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder
Language Model for Video-grounded Dialogue Generation [25.273719615694958]
We propose a novel approach named MSG-B-ART which enhances the integration of video information.
Specifically, we integrate global and local scene graph into the encoder and decoder, respectively.
Extensive experiments are conducted on three video-grounded dialogue benchmarks, which show the significant superiority of MSG-B-ART.
arXiv Detail & Related papers (2023-09-26T04:23:23Z) - VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z) - A Unified Framework for Slot based Response Generation in a Multimodal
Dialogue System [25.17100881568308]
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system.
We propose an end-to-end framework with the capability to extract necessary slot values from the utterance.
We employ a multimodal hierarchical encoder using pre-trained DialoGPT to provide a stronger context for both tasks.
arXiv Detail & Related papers (2023-05-27T10:06:03Z) - ReSee: Responding through Seeing Fine-grained Visual Knowledge in
Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity.
To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset.
By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Learning Reasoning Paths over Semantic Graphs for Video-grounded
Dialogues [73.04906599884868]
We propose a novel framework of Reasoning Paths in Dialogue Context (PDC)
PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer.
Our model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer.
arXiv Detail & Related papers (2021-03-01T07:39:26Z) - Reasoning in Dialog: Improving Response Generation by Context Reading
Comprehension [49.92173751203827]
In multi-turn dialog, utterances do not always take the full form of sentences.
We propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question.
arXiv Detail & Related papers (2020-12-14T10:58:01Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.