Related papers: Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

URL: http://arxiv.org/abs/2603.01696v1
Date: Mon, 02 Mar 2026 10:24:41 GMT
Title: Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Authors: Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang,
Abstract summary: Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions.<n>Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions.<n>We propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations.
Score: 20.275550783343107
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.

Related papers

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.<n>We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.<n>Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z)
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions [31.637204677787576]
We introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding.<n>KnowAda minimizes hallucinations while preserving high descriptiveness.<n>Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2024-11-13T20:50:04Z)
The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge [19.339645217996235]
We propose a solution for improving the quality of captions generated for figures in papers. Our approach ranked first in the final test with a score of 4.49.
arXiv Detail & Related papers (2024-03-26T03:03:50Z)
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information. We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z)
Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z)
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [8.526212812623202]
State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context dataset.<n>We present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models.
arXiv Detail & Related papers (2023-06-20T15:13:02Z)
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts. Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model. We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE) Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.