Neural Twins Talk
- URL: http://arxiv.org/abs/2009.12524v1
- Date: Sat, 26 Sep 2020 06:58:58 GMT
- Title: Neural Twins Talk
- Authors: Zanyar Zohourianshahzadi (UCCS) and Jugal Kumar Kalita (UCCS)
- Abstract summary: We introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model.
Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image.
We report the results of our experiments in three image captioning tasks on the COCO dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by how the human brain employs more neural pathways when increasing
the focus on a subject, we introduce a novel twin cascaded attention model that
outperforms a state-of-the-art image captioning model that was originally
implemented using one channel of attention for the visual grounding task.
Visual grounding ensures the existence of words in the caption sentence that
are grounded into a particular region in the input image. After a deep learning
model is trained on visual grounding task, the model employs the learned
patterns regarding the visual grounding and the order of objects in the caption
sentences, when generating captions. We report the results of our experiments
in three image captioning tasks on the COCO dataset. The results are reported
using standard image captioning metrics to show the improvements achieved by
our model over the previous image captioning model. The results gathered from
our experiments suggest that employing more parallel attention pathways in a
deep neural network leads to higher performance. Our implementation of NTT is
publicly available at: https://github.com/zanyarz/NeuralTwinsTalk.
Related papers
- Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Connecting What to Say With Where to Look by Modeling Human Attention
Traces [30.8226861256742]
We introduce a unified framework to jointly model images, text, and human attention traces.
We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given only an image.
arXiv Detail & Related papers (2021-05-12T20:53:30Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.