Saying the Unseen: Video Descriptions via Dialog Agents
- URL: http://arxiv.org/abs/2106.14069v1
- Date: Sat, 26 Jun 2021 17:36:31 GMT
- Title: Saying the Unseen: Video Descriptions via Dialog Agents
- Authors: Ye Zhu, Yu Wu, Yi Yang, Yan Yan
- Abstract summary: We introduce a novel task that aims to describe a video using the natural language dialog between two agents.
Q-BOT is given two semantic segmented frames from the beginning and the end of the video.
A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions.
- Score: 37.16726118481626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current vision and language tasks usually take complete visual data (e.g.,
raw images or videos) as input, however, practical scenarios may often consist
the situations where part of the visual information becomes inaccessible due to
various reasons e.g., restricted view with fixed camera or intentional vision
block for security concerns. As a step towards the more practical application
scenarios, we introduce a novel task that aims to describe a video using the
natural language dialog between two agents as a supplementary information
source given incomplete visual data. Different from most existing
vision-language tasks where AI systems have full access to images or video
clips, which may reveal sensitive information such as recognizable human faces
or voices, we intentionally limit the visual input for AI systems and seek a
more secure and transparent information medium, i.e., the natural language
dialog, to supplement the missing visual information. Specifically, one of the
intelligent agents - Q-BOT - is given two semantic segmented frames from the
beginning and the end of the video, as well as a finite number of opportunities
to ask relevant natural language questions before describing the unseen video.
A-BOT, the other agent who has access to the entire video, assists Q-BOT to
accomplish the goal by answering the asked questions. We introduce two
different experimental settings with either a generative (i.e., agents generate
questions and answers freely) or a discriminative (i.e., agents select the
questions and answers from candidates) internal dialog generation process. With
the proposed unified QA-Cooperative networks, we experimentally demonstrate the
knowledge transfer process between the two dialog agents and the effectiveness
of using the natural language dialog as a supplement for incomplete implicit
visions.
Related papers
- VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents [37.120459786999724]
We introduce a new task called video description via two multi-modal cooperative dialog agents.
Q-BOT is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions.
A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions.
arXiv Detail & Related papers (2020-08-18T14:01:09Z) - Knowledge-Based Video Question Answering with Unsupervised Scene
Descriptions [27.63022376316052]
We design ROLL, a model for knowledge-based video story question answering.
In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion.
To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism.
arXiv Detail & Related papers (2020-07-17T04:26:38Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Multi-View Attention Network for Visual Dialog [5.731758300670842]
It is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents.
We propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs.
MVAN effectively captures the question-relevant information from the dialog history with two complementary modules.
arXiv Detail & Related papers (2020-04-29T08:46:38Z) - Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.