Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
- URL: http://arxiv.org/abs/2504.18201v1
- Date: Fri, 25 Apr 2025 09:31:03 GMT
- Title: Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
- Authors: Yin Tang, Jiankai Li, Hongyu Yang, Xuan Dong, Lifeng Fan, Weixin Li,
- Abstract summary: We introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition.<n>Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features.<n>We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations.
- Score: 14.435031094457026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra-class variety in conveying abstract concepts, e.g. "enjoy life". Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features. We adopt class-specific prototypes to alleviate data imbalance. We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state-of-the-art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.
Related papers
- Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes.<n>Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features.<n>This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z) - Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models [0.65268245109828]
We introduce the notion of contextual diversity for active learning CDAL.
We propose a data repair algorithm to curate contextually fair data to reduce model bias.
We are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads.
arXiv Detail & Related papers (2024-11-04T09:43:33Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - CLiC: Concept Learning in Context [54.81654147248919]
This paper builds upon recent advancements in visual concept learning.
It involves acquiring a visual concept from a source image and subsequently applying it to an object in a target image.
To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area.
arXiv Detail & Related papers (2023-11-28T01:33:18Z) - Semantic-Aware Fine-Grained Correspondence [8.29030327276322]
We propose to learn semantic-aware fine-grained correspondence using image-level self-supervised methods.
We design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence.
Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks.
arXiv Detail & Related papers (2022-07-21T12:51:41Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - Exploring Visual Engagement Signals for Representation Learning [56.962033268934015]
We present VisE, a weakly supervised learning approach, which maps social images to pseudo labels derived by clustered engagement signals.
We then study how models trained in this way benefit subjective downstream computer vision tasks such as emotion recognition or political bias detection.
arXiv Detail & Related papers (2021-04-15T20:50:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.