VLRM: Vision-Language Models act as Reward Models for Image Captioning
- URL: http://arxiv.org/abs/2404.01911v1
- Date: Tue, 2 Apr 2024 12:57:22 GMT
- Title: VLRM: Vision-Language Models act as Reward Models for Image Captioning
- Authors: Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta,
- Abstract summary: We present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models.
Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split.
- Score: 45.59831141171801
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.
Related papers
- Self-Rewarding Language Models [105.6830788170348]
We study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.
arXiv Detail & Related papers (2024-01-18T14:43:47Z) - Linear Alignment of Vision-language Models for Image Captioning [9.746397419479447]
We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP.
This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap.
We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT.
arXiv Detail & Related papers (2023-07-10T17:59:21Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision [32.4697157553247]
In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset.
Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption.
arXiv Detail & Related papers (2021-12-27T20:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.