Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data
- URL: http://arxiv.org/abs/2507.08610v1
- Date: Fri, 11 Jul 2025 14:08:36 GMT
- Title: Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data
- Authors: Parag Dutta, Ambedkar Dukkipati,
- Abstract summary: We propose LoGIC, a Multi-agent Reinforcement Learning game.<n>We train agents in the cooperative common-reward setting using the GRPO algorithm.<n>We show that using pre-trained VLMs as the'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score.
- Score: 10.91762734823246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.
Related papers
- TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z) - Contrastive Vision-Language Alignment Makes Efficient Instruction
Learner [31.281236193979165]
We study the task of extending the large language model (LLM) into a vision-language instruction-following model.
Existing methods typically train a visual adapter to align the representation between a pre-trained vision transformer (ViT) and the LLM by a generative image captioning loss.
We propose CG-VLM that applies Contrastive and Generative alignment objectives to effectively align the representation of ViT and LLM.
arXiv Detail & Related papers (2023-11-29T03:29:46Z) - Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.