TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
- URL: http://arxiv.org/abs/2504.07556v1
- Date: Thu, 10 Apr 2025 08:37:13 GMT
- Title: TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
- Authors: Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao,
- Abstract summary: TokenFocus-VQA is a novel evaluation framework that leverages Large Vision-Language Models (LVLMs) through visual question answering (VQA) paradigm with position-specific probability optimization.<n>Our key innovation lies in designing a token-aware loss function that selectively focuses on probability distributions at pre-defined vocabulary positions corresponding to crucial semantic elements.
- Score: 12.08567890278249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While text-to-image (T2I) generation models have achieved remarkable progress in recent years, existing evaluation methodologies for vision-language alignment still struggle with the fine-grained semantic matching. Current approaches based on global similarity metrics often overlook critical token-level correspondences between textual descriptions and visual content. To this end, we present TokenFocus-VQA, a novel evaluation framework that leverages Large Vision-Language Models (LVLMs) through visual question answering (VQA) paradigm with position-specific probability optimization. Our key innovation lies in designing a token-aware loss function that selectively focuses on probability distributions at pre-defined vocabulary positions corresponding to crucial semantic elements, enabling precise measurement of fine-grained semantical alignment. The proposed framework further integrates ensemble learning techniques to aggregate multi-perspective assessments from diverse LVLMs architectures, thereby achieving further performance enhancement. Evaluated on the NTIRE 2025 T2I Quality Assessment Challenge Track 1, our TokenFocus-VQA ranks 2nd place (0.8445, only 0.0001 lower than the 1st method) on public evaluation and 2nd place (0.8426) on the official private test set, demonstrating superiority in capturing nuanced text-image correspondences compared to conventional evaluation methods.
Related papers
- T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting [20.21019748095159]
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions.
We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models.
arXiv Detail & Related papers (2025-02-28T01:09:18Z) - What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness [30.44039177018447]
CAPability is a comprehensive benchmark for evaluating visual captioning across 12 dimensions spanning six critical views.
We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions.
arXiv Detail & Related papers (2025-02-19T07:55:51Z) - Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition [61.956088652094515]
Vision and language models (VLMs) have showcased remarkable zero-shot recognition abilities.
But they face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment.
This paper explores the intricate relationship between compositionality and recognition.
arXiv Detail & Related papers (2024-06-13T17:58:39Z) - DualFocus: Integrating Plausible Descriptions in Text-based Person Re-identification [6.381155145404096]
We introduce DualFocus, a unified framework that integrates plausible descriptions to enhance the interpretative accuracy of vision-language models in Person Re-identification tasks.
To achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss.
The comprehensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid, DualFocus demonstrates superior performance over state-of-the-art methods.
arXiv Detail & Related papers (2024-05-13T04:21:00Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.