Describe-to-Score: Text-Guided Efficient Image Complexity Assessment
- URL: http://arxiv.org/abs/2509.16609v1
- Date: Sat, 20 Sep 2025 10:17:25 GMT
- Title: Describe-to-Score: Text-Guided Efficient Image Complexity Assessment
- Authors: Shipeng Liu, Zhonglin Zhang, Dengfeng Chen, Liang Zhao,
- Abstract summary: Accurately assessing image complexity (IC) is critical for computer vision.<n>We introduce vision-text fusion for IC modeling.<n>We propose the D2S (Describe-to-Score) framework, which generates image captions with a pre-trained vision-language model.
- Score: 5.744778242421451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately assessing image complexity (IC) is critical for computer vision, yet most existing methods rely solely on visual features and often neglect high-level semantic information, limiting their accuracy and generalization. We introduce vision-text fusion for IC modeling. This approach integrates visual and textual semantic features, increasing representational diversity. It also reduces the complexity of the hypothesis space, which enhances both accuracy and generalization in complexity assessment. We propose the D2S (Describe-to-Score) framework, which generates image captions with a pre-trained vision-language model. We propose the feature alignment and entropy distribution alignment mechanisms, D2S guides semantic information to inform complexity assessment while bridging the gap between vision and text modalities. D2S utilizes multi-modal information during training but requires only the vision branch during inference, thereby avoiding multi-modal computational overhead and enabling efficient assessment. Experimental results demonstrate that D2S outperforms existing methods on the IC9600 dataset and maintains competitiveness on no-reference image quality assessment (NR-IQA) benchmark, validating the effectiveness and efficiency of multi-modal fusion in complexity-related tasks. Code is available at: https://github.com/xauat-liushipeng/D2S
Related papers
- Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment [51.40989269202702]
aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC.<n>We propose ArtQuant, an aesthetics assessment framework for artistic images which couples isolated aesthetic dimensions through description generation.<n>Our approach achieves epoch state-of-the-art performance on several datasets while requiring only 33% of conventional trainings.
arXiv Detail & Related papers (2025-12-29T12:18:26Z) - Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation [0.0]
This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures.<n>The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation.<n>The framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations.
arXiv Detail & Related papers (2025-12-14T08:28:50Z) - DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction [1.771934382051849]
We propose a vision-only model that fuses self-supervised and convolutional representations to predict image complexity.<n>DReX achieves state-of-the-art performance on the IC9600 benchmark.<n>Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction.
arXiv Detail & Related papers (2025-11-21T06:57:33Z) - Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - Text-Visual Semantic Constrained AI-Generated Image Quality Assessment [47.575342788480505]
We propose a unified framework to enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images.<n>Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules.<n>Tests conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-07-14T16:21:05Z) - Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring [26.174094671736686]
We propose a novel quality-driven data selection pipeline for visual instruction tuning datasets.<n>It integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task.<n>It generates general and task-specific captions, and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry.
arXiv Detail & Related papers (2025-06-10T04:04:58Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking [0.5242869847419834]
We propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy.
To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution.
We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets.
arXiv Detail & Related papers (2023-09-15T04:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.