Learning to Discretely Compose Reasoning Module Networks for Video
Captioning
- URL: http://arxiv.org/abs/2007.09049v1
- Date: Fri, 17 Jul 2020 15:27:37 GMT
- Title: Learning to Discretely Compose Reasoning Module Networks for Video
Captioning
- Authors: Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha
- Abstract summary: We propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN)
RMN employs three sophisticated RM-temporal reasoning, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation.
- Score: 81.81394228898591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating natural language descriptions for videos, i.e., video captioning,
essentially requires step-by-step reasoning along the generation process. For
example, to generate the sentence "a man is shooting a basketball", we need to
first locate and describe the subject "man", next reason out the man is
"shooting", then describe the object "basketball" of shooting. However,
existing visual reasoning methods designed for visual question answering are
not appropriate to video captioning, for it requires more complex visual
reasoning on videos over both space and time, and dynamic module composition
along the generation process. In this paper, we propose a novel visual
reasoning approach for video captioning, named Reasoning Module Networks (RMN),
to equip the existing encoder-decoder framework with the above reasoning
capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal
reasoning modules, and 2) a dynamic and discrete module selector trained by a
linguistic loss with a Gumbel approximation. Extensive experiments on MSVD and
MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art
methods while providing an explicit and explainable generation process. Our
code is available at https://github.com/tgc1997/RMN.
Related papers
- Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
for Video Question Answering [42.173245795917026]
We propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering.
STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks.
We conduct extensive experiments on several video question answering datasets to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available.
arXiv Detail & Related papers (2024-01-08T14:01:59Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - METEOR Guided Divergence for Video Captioning [4.601294270277376]
We propose a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations.
We show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively.
arXiv Detail & Related papers (2022-12-20T23:30:47Z) - Learning to Collocate Visual-Linguistic Neural Modules for Image
Captioning [80.59607794927363]
We propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (LNCVM)
Unlike the rewidely used neural module networks in VQA, the task of collocating visual-linguistic modules is more challenging.
Our CVLNM is more effective,.
achieving a new state-of-the-art 129.5 CIDEr-D, and more robust.
Experiments on the MS-COCO dataset show that our CVLNM is more effective,.
achieving a new state-of-the-art 129.5 CIDEr
arXiv Detail & Related papers (2022-10-04T03:09:50Z) - LGDN: Language-Guided Denoising Network for Video-Language Modeling [30.99646752913056]
We propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN) for video-language modeling.
Our LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2--4 salient frames per video for cross-modal token-level alignment.
arXiv Detail & Related papers (2022-09-23T03:35:59Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.