Exploring Affordance and Situated Meaning in Image Captions: A
Multimodal Analysis
- URL: http://arxiv.org/abs/2305.14616v2
- Date: Tue, 24 Oct 2023 11:30:07 GMT
- Title: Exploring Affordance and Situated Meaning in Image Captions: A
Multimodal Analysis
- Authors: Pin-Er Chen, Po-Ya Angela Wang, Hsin-Yu Chou, Yu-Hsiang Tseng, Shu-Kai
Hsieh
- Abstract summary: We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Cue Gazeing, and Ecological Niche Association (ENA)
Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance.
- Score: 1.124958340749622
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper explores the grounding issue regarding multimodal semantic
representation from a computational cognitive-linguistic view. We annotate
images from the Flickr30k dataset with five perceptual properties: Affordance,
Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche
Association (ENA), and examine their association with textual elements in the
image captions. Our findings reveal that images with Gibsonian affordance show
a higher frequency of captions containing 'holding-verbs' and 'container-nouns'
compared to images displaying telic affordance. Perceptual Salience, Object
Number, and ENA are also associated with the choice of linguistic expressions.
Our study demonstrates that comprehensive understanding of objects or events
requires cognitive attention, semantic nuances in language, and integration
across multiple modalities. We highlight the vital importance of situated
meaning and affordance grounding in natural language understanding, with the
potential to advance human-like interpretation in various scenarios.
Related papers
- StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images [5.529078451095096]
understanding the semantics of visual scenes is a fundamental challenge in Computer Vision.
Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics.
Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks.
arXiv Detail & Related papers (2024-06-19T17:59:40Z) - Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining? [34.609984453754656]
We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment.
Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
arXiv Detail & Related papers (2023-08-24T16:17:40Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Visual Semantic Parsing: From Images to Abstract Meaning Representation [20.60579156219413]
We propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR)
Our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input.
Our findings point to important future research directions for improved scene understanding.
arXiv Detail & Related papers (2022-10-26T17:06:42Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.