Learning to Collocate Visual-Linguistic Neural Modules for Image
Captioning
- URL: http://arxiv.org/abs/2210.01338v2
- Date: Mon, 24 Apr 2023 02:27:07 GMT
- Title: Learning to Collocate Visual-Linguistic Neural Modules for Image
Captioning
- Authors: Xu Yang and Hanwang Zhang and Chongyang Gao and Jianfei Cai
- Abstract summary: We propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (LNCVM)
Unlike the rewidely used neural module networks in VQA, the task of collocating visual-linguistic modules is more challenging.
Our CVLNM is more effective,.
achieving a new state-of-the-art 129.5 CIDEr-D, and more robust.
Experiments on the MS-COCO dataset show that our CVLNM is more effective,.
achieving a new state-of-the-art 129.5 CIDEr
- Score: 80.59607794927363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans tend to decompose a sentence into different parts like \textsc{sth do
sth at someplace} and then fill each part with certain content. Inspired by
this, we follow the \textit{principle of modular design} to propose a novel
image captioner: learning to Collocate Visual-Linguistic Neural Modules
(CVLNM). Unlike the \re{widely used} neural module networks in VQA, where the
language (\ie, question) is fully observable, \re{the task of collocating
visual-linguistic modules is more challenging.} This is because the language is
only partially observable, for which we need to dynamically collocate the
modules during the process of image captioning. To sum up, we make the
following technical contributions to design and train our CVLNM: 1)
\textit{distinguishable module design} -- \re{four modules in the encoder}
including one linguistic module for function words and three visual modules for
different content words (\ie, noun, adjective, and verb) and another linguistic
one in the decoder for commonsense reasoning, 2) a self-attention based
\textit{module controller} for robustifying the visual reasoning, 3) a
part-of-speech based \textit{syntax loss} imposed on the module controller for
further regularizing the training of our CVLNM. Extensive experiments on the
MS-COCO dataset show that our CVLNM is more effective, \eg, achieving a new
state-of-the-art 129.5 CIDEr-D, and more robust, \eg, being less likely to
overfit to dataset bias and suffering less when fewer training samples are
available. Codes are available at \url{https://github.com/GCYZSL/CVLMN}
Related papers
- MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification [14.725941791069852]
We propose underlineMedical underlineUnsupervised underlineAdaptation (textttMedUnA), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning.
We evaluate the performance of textttMedUnA on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images.
arXiv Detail & Related papers (2024-09-03T09:25:51Z) - GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs [64.49176353858792]
We propose generative neuro-symbolic visual reasoning by growing and reusing modules.
The proposed model performs competitively on standard tasks like visual question answering and referring expression comprehension.
It is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
arXiv Detail & Related papers (2023-11-08T18:59:05Z) - Explaining black box text modules in natural language with language
models [86.14329261605]
"Black box" indicates that we only have access to the module's inputs/outputs.
"SASC" is a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is.
We show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping.
arXiv Detail & Related papers (2023-05-17T00:29:18Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Learning to Discretely Compose Reasoning Module Networks for Video
Captioning [81.81394228898591]
We propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN)
RMN employs three sophisticated RM-temporal reasoning, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation.
arXiv Detail & Related papers (2020-07-17T15:27:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.