Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
- URL: http://arxiv.org/abs/2008.07935v2
- Date: Sat, 22 Aug 2020 12:58:39 GMT
- Title: Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
- Authors: Ye Zhu, Yu Wu, Yi Yang, and Yan Yan
- Abstract summary: We introduce a new task called video description via two multi-modal cooperative dialog agents.
Q-BOT is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions.
A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions.
- Score: 37.120459786999724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the arising concerns for the AI systems provided with direct access to
abundant sensitive information, researchers seek to develop more reliable AI
with implicit information sources. To this end, in this paper, we introduce a
new task called video description via two multi-modal cooperative dialog
agents, whose ultimate goal is for one conversational agent to describe an
unseen video based on the dialog and two static frames. Specifically, one of
the intelligent agents - Q-BOT - is given two static frames from the beginning
and the end of the video, as well as a finite number of opportunities to ask
relevant natural language questions before describing the unseen video. A-BOT,
the other agent who has already seen the entire video, assists Q-BOT to
accomplish the goal by providing answers to those questions. We propose a
QA-Cooperative Network with a dynamic dialog history update learning mechanism
to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better
describe the video. Extensive experiments demonstrate that Q-BOT can
effectively learn to describe an unseen video by the proposed model and the
cooperative learning method, achieving the promising performance where Q-BOT is
given the full ground truth history dialog.
Related papers
- Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Few-Shot Bot: Prompt-Based Learning for Dialogue Systems [58.27337673451943]
Learning to converse using only a few examples is a great challenge in conversational AI.
The current best conversational models are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL)
We propose prompt-based few-shot learning which does not require gradient-based fine-tuning but instead uses a few examples as the only source of learning.
arXiv Detail & Related papers (2021-10-15T14:36:45Z) - Saying the Unseen: Video Descriptions via Dialog Agents [37.16726118481626]
We introduce a novel task that aims to describe a video using the natural language dialog between two agents.
Q-BOT is given two semantic segmented frames from the beginning and the end of the video.
A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions.
arXiv Detail & Related papers (2021-06-26T17:36:31Z) - Towards Data Distillation for End-to-end Spoken Conversational Question
Answering [65.124088336738]
We propose a new Spoken Conversational Question Answering task (SCQA)
SCQA aims at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora.
Our main objective is to build a QA system to deal with conversational questions both in spoken and text forms.
arXiv Detail & Related papers (2020-10-18T05:53:39Z) - Multimodal Dialogue State Tracking By QA Approach with Data Augmentation [16.436557991074068]
This paper interprets the Audio-Video Scene-Aware Dialogue (AVSD) task from an open-domain Question Answering (QA) point of view.
The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention.
Our experiments show that our model and techniques bring significant improvements over the baseline model on the DSTC7-AVSD dataset.
arXiv Detail & Related papers (2020-07-20T06:23:18Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.