Neural Twins Talk & Alternative Calculations
- URL: http://arxiv.org/abs/2108.02807v1
- Date: Thu, 5 Aug 2021 18:41:34 GMT
- Title: Neural Twins Talk & Alternative Calculations
- Authors: Zanyar Zohourianshahzadi, Jugal K. Kalita
- Abstract summary: Inspired by how the human brain employs a higher number of neural pathways when describing a highly focused subject, we show that deep attentive models could be extended to achieve better performance.
Image captioning bridges a gap between computer vision and natural language processing.
- Score: 3.198144010381572
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Inspired by how the human brain employs a higher number of neural pathways
when describing a highly focused subject, we show that deep attentive models
used for the main vision-language task of image captioning, could be extended
to achieve better performance. Image captioning bridges a gap between computer
vision and natural language processing. Automated image captioning is used as a
tool to eliminate the need for human agent for creating descriptive captions
for unseen images.Automated image captioning is challenging and yet
interesting. One reason is that AI based systems capable of generating
sentences that describe an input image could be used in a wide variety of tasks
beyond generating captions for unseen images found on web or uploaded to social
media. For example, in biology and medical sciences, these systems could
provide researchers and physicians with a brief linguistic description of
relevant images, potentially expediting their work.
Related papers
- Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Representing visual classification as a linear combination of words [0.0]
We present an explainability strategy that uses a vision-language model to identify language-based descriptors of a visual classification task.
By leveraging a pre-trained joint embedding space between images and text, our approach estimates a new classification task as a linear combination of words.
We find that the resulting descriptors largely align with clinical knowledge despite a lack of domain-specific language training.
arXiv Detail & Related papers (2023-11-18T02:00:20Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Vision-Language Models in Remote Sensing: Current Progress and Future Trends [25.017685538386548]
Vision-language models enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics.
Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image.
This paper provides a comprehensive review of the research on vision-language models in remote sensing.
arXiv Detail & Related papers (2023-05-09T19:17:07Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Describing image focused in cognitive and visual details for visually
impaired people: An approach to generating inclusive paragraphs [2.362412515574206]
There is a lack of services that support specific tasks, such as understanding the image context presented in online content, e.g., webinars.
We propose an approach for generating context of webinar images combining a dense captioning technique with a set of filters, to fit the captions in our domain, and a language model for the abstractive summary task.
arXiv Detail & Related papers (2022-02-10T21:20:53Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Goal-driven text descriptions for images [7.059848512713061]
This thesis focuses on generating textual output given visual input.
We use a comprehension machine to guide the generated referring expressions to be more discriminative.
In Chapter 5, we study how training objectives and sampling methods affect the models' ability to generate diverse captions.
arXiv Detail & Related papers (2021-08-28T05:10:38Z) - Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.