UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
- URL: http://arxiv.org/abs/2512.21675v1
- Date: Thu, 25 Dec 2025 13:35:52 GMT
- Title: UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
- Authors: Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu,
- Abstract summary: We present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture.<n>We develop a strong baseline UniPercept trained via Domain-Aligned Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks.
- Score: 46.88622050135756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.
Related papers
- Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models [9.24989979549793]
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks.<n>These models typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision.<n>We introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency.
arXiv Detail & Related papers (2025-09-23T16:07:18Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task.<n>Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes.<n>Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z) - Emergent Visual-Semantic Hierarchies in Image-Text Representations [13.300199242824934]
We study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies.
We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding.
arXiv Detail & Related papers (2024-07-11T14:09:42Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.