Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
- URL: http://arxiv.org/abs/2503.15337v1
- Date: Wed, 19 Mar 2025 15:33:44 GMT
- Title: Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
- Authors: Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei,
- Abstract summary: We present RAM (Recover And Match), a novel framework that effectively addresses the above issues.<n> RAM achieves state-of-the-art performance on various datasets from three distinct domains.
- Score: 45.866011150937425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label [7.400926717561454]
This paper investigates a framework for weakly-supervised object localization.
It aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels.
arXiv Detail & Related papers (2024-04-15T06:02:09Z) - Unsupervised Adaptation of Polyp Segmentation Models via Coarse-to-Fine
Self-Supervision [16.027843524655516]
We study a practical problem of Source-Free Domain Adaptation (SFDA), which eliminates the reliance on annotated source data.
Current SFDA methods focus on extracting domain knowledge from the source-trained model but neglects the intrinsic structure of the target domain.
We propose a new SFDA framework, called Region-to-Pixel Adaptation Network(RPANet), which learns the region-level and pixel-level discriminative representations through coarse-to-fine self-supervision.
arXiv Detail & Related papers (2023-08-13T02:37:08Z) - Adaptive Face Recognition Using Adversarial Information Network [57.29464116557734]
Face recognition models often degenerate when training data are different from testing data.
We propose a novel adversarial information network (AIN) to address it.
arXiv Detail & Related papers (2023-05-23T02:14:11Z) - CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding [86.79903269137971]
Unsupervised visual grounding has been developed to locate regions using pseudo-labels.
We propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.
Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-05-15T14:42:02Z) - Semantic-diversity transfer network for generalized zero-shot learning
via inner disagreement based OOD detector [26.89763840782029]
Zero-shot learning (ZSL) aims to recognize objects from unseen classes, where the kernel problem is to transfer knowledge from seen classes to unseen classes.
The knowledge transfer in many existing works is limited mainly due to the facts that 1) the widely used visual features are global ones but not totally consistent with semantic attributes.
We propose a Semantic-diversity transfer Network (SetNet) addressing the first two limitations, where 1) a multiple-attention architecture and a diversity regularizer are proposed to learn multiple local visual features that are more consistent with semantic attributes and 2) a projector ensemble that geometrically takes diverse local features as inputs
arXiv Detail & Related papers (2022-03-17T01:31:27Z) - Coarse to Fine: Domain Adaptive Crowd Counting via Adversarial Scoring
Network [58.05473757538834]
This paper proposes a novel adversarial scoring network (ASNet) to bridge the gap across domains from coarse to fine granularity.
Three sets of migration experiments show that the proposed methods achieve state-of-the-art counting performance.
arXiv Detail & Related papers (2021-07-27T14:47:24Z) - Seeking the Shape of Sound: An Adaptive Framework for Learning
Voice-Face Association [94.7030305679589]
We propose a novel framework to jointly address the above-mentioned issues.
We introduce a global loss into the modality alignment process.
The proposed method outperforms the previous methods in multiple settings.
arXiv Detail & Related papers (2021-03-12T14:10:48Z) - Find it if You Can: End-to-End Adversarial Erasing for Weakly-Supervised
Semantic Segmentation [6.326017213490535]
We propose a novel formulation of adversarial erasing of the attention maps.
The proposed solution does not require saliency masks, instead it uses a regularization loss to prevent the attention maps from spreading to less discriminative object regions.
Our experiments on the Pascal VOC dataset demonstrate that our adversarial approach increases segmentation performance by 2.1 mIoU compared to our baseline and by 1.0 mIoU compared to previous adversarial erasing approaches.
arXiv Detail & Related papers (2020-11-09T18:35:35Z) - Contextual-Relation Consistent Domain Adaptation for Semantic
Segmentation [44.19436340246248]
This paper presents an innovative local contextual-relation consistent domain adaptation technique.
It aims to achieve local-level consistencies during the global-level alignment.
Experiments demonstrate its superior segmentation performance as compared with state-of-the-art methods.
arXiv Detail & Related papers (2020-07-05T19:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.