Intentonomy: a Dataset and Study towards Human Intent Understanding
- URL: http://arxiv.org/abs/2011.05558v2
- Date: Sun, 28 Mar 2021 02:24:46 GMT
- Title: Intentonomy: a Dataset and Study towards Human Intent Understanding
- Authors: Menglin Jia and Zuxuan Wu and Austin Reiter and Claire Cardie and
Serge Belongie and Ser-Nam Lim
- Abstract summary: We study the intent behind social media images with an aim to analyze how visual information can help the recognition of human intent.
We introduce an intent dataset, Intentonomy, comprising 14K images covering a wide range of everyday scenes.
We then systematically study whether, and to what extent, commonly used visual information, i.e., object and context, contribute to human motive understanding.
- Score: 65.49299806821791
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An image is worth a thousand words, conveying information that goes beyond
the physical visual content therein. In this paper, we study the intent behind
social media images with an aim to analyze how visual information can help the
recognition of human intent. Towards this goal, we introduce an intent dataset,
Intentonomy, comprising 14K images covering a wide range of everyday scenes.
These images are manually annotated with 28 intent categories that are derived
from a social psychology taxonomy. We then systematically study whether, and to
what extent, commonly used visual information, i.e., object and context,
contribute to human motive understanding. Based on our findings, we conduct
further study to quantify the effect of attending to object and context classes
as well as textual information in the form of hashtags when training an intent
classifier. Our results quantitatively and qualitatively shed light on how
visual and textual information can produce observable effects when predicting
intent.
Related papers
- Structuring Quantitative Image Analysis with Object Prominence [0.0]
We suggest carefully considering objects' prominence as an essential step in analyzing images as data.
Our approach combines qualitative analyses with the scalability of quantitative approaches.
arXiv Detail & Related papers (2024-08-30T19:05:28Z) - An Image-based Typology for Visualization [23.716718517642878]
We present and discuss the results of a qualitative analysis of visual representations from images.
We derive a typology of 10 visualization types of defined groups.
We provide a dataset of 6,833 tagged images and an online tool that can be used to explore and analyze the large set of labeled images.
arXiv Detail & Related papers (2024-03-07T04:33:42Z) - CLiC: Concept Learning in Context [54.81654147248919]
This paper builds upon recent advancements in visual concept learning.
It involves acquiring a visual concept from a source image and subsequently applying it to an object in a target image.
To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area.
arXiv Detail & Related papers (2023-11-28T01:33:18Z) - Impressions: Understanding Visual Semiotics and Aesthetic Impact [66.40617566253404]
We present Impressions, a novel dataset through which to investigate the semiotics of images.
We show that existing multimodal image captioning and conditional generation models struggle to simulate plausible human responses to images.
This dataset significantly improves their ability to model impressions and aesthetic evaluations of images through fine-tuning and few-shot adaptation.
arXiv Detail & Related papers (2023-10-27T04:30:18Z) - Exploring Affordance and Situated Meaning in Image Captions: A
Multimodal Analysis [1.124958340749622]
We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Cue Gazeing, and Ecological Niche Association (ENA)
Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance.
arXiv Detail & Related papers (2023-05-24T01:30:50Z) - Visual Clues: Bridging Vision and Language Foundations for Image
Paragraph Captioning [78.07495777674747]
We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training.
Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image.
We use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image.
arXiv Detail & Related papers (2022-06-03T22:33:09Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z) - ArtEmis: Affective Language for Visual Art [46.643106054408285]
We focus on the affective experience triggered by visual artworks.
We ask the annotators to indicate the dominant emotion they feel for a given image.
This leads to a rich set of signals for both the objective content and the affective impact of an image.
arXiv Detail & Related papers (2021-01-19T01:03:40Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.