Aligning and Prompting Everything All at Once for Universal Visual
Perception
- URL: http://arxiv.org/abs/2312.02153v1
- Date: Mon, 4 Dec 2023 18:59:50 GMT
- Title: Aligning and Prompting Everything All at Once for Universal Visual
Perception
- Authors: Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing
Sun, Yunsheng Wu, Shaohui Lin, Rongrong Ji
- Abstract summary: APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
- Score: 79.96124061108728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision foundation models have been explored recently to build general-purpose
vision systems. However, predominant paradigms, driven by casting
instance-level tasks as an object-word alignment, bring heavy cross-modality
interaction, which is not effective in prompting object detection and visual
grounding. Another line of work that focuses on pixel-level tasks often
encounters a large annotation gap of things and stuff, and suffers from mutual
interference between foreground-object and background-class segmentation. In
stark contrast to the prevailing methods, we present APE, a universal visual
perception model for aligning and prompting everything all at once in an image
to perform diverse tasks, i.e., detection, segmentation, and grounding, as an
instance-level sentence-object matching paradigm. Specifically, APE advances
the convergence of detection and grounding by reformulating language-guided
grounding as open-vocabulary detection, which efficiently scales up model
prompting to thousands of category vocabularies and region descriptions while
maintaining the effectiveness of cross-modality fusion. To bridge the
granularity gap of different pixel-level tasks, APE equalizes semantic and
panoptic segmentation to proxy instance learning by considering any isolated
regions as individual instances. APE aligns vision and language representation
on broad data with natural and challenging characteristics all at once without
task-specific fine-tuning. The extensive experiments on over 160 datasets
demonstrate that, with only one-suit of weights, APE outperforms (or is on par
with) the state-of-the-art models, proving that an effective yet universal
perception for anything aligning and prompting is indeed feasible. Codes and
trained models are released at https://github.com/shenyunhang/APE.
Related papers
- More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection [70.11264880907652]
Recent object (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios.
We propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and camouflaged zooming in and out.
Our framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.
arXiv Detail & Related papers (2023-10-31T06:11:23Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Phrase-Based Affordance Detection via Cyclic Bilateral Interaction [17.022853987801877]
We explore to perceive affordance from a vision-language perspective and consider the challenging phrase-based affordance detection problem.
We propose a cyclic bilateral consistency enhancement network (CBCE-Net) to align language and vision features progressively.
Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module (CIM) that facilitates the perception of possible interaction with objects in a cyclic manner.
arXiv Detail & Related papers (2022-02-24T13:02:27Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Distribution Alignment: A Unified Framework for Long-tail Visual
Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition.
We then introduce a generalized re-weight method in the two-stage learning to balance the class prior.
Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.