Enhancing Image Caption Generation Using Reinforcement Learning with
Human Feedback
- URL: http://arxiv.org/abs/2403.06735v1
- Date: Mon, 11 Mar 2024 13:57:05 GMT
- Title: Enhancing Image Caption Generation Using Reinforcement Learning with
Human Feedback
- Authors: Adarsh N L, Arun P V, Aravindh N L
- Abstract summary: We explore a potential method to amplify the performance of the Deep Neural Network Model to generate captions that are preferred by humans.
This was achieved by integrating Supervised Learning and Reinforcement Learning with Human Feedback.
We provide a sketch of our approach and results, hoping to contribute to the ongoing advances in the field of human-aligned generative AI models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research on generative models to produce human-aligned / human-preferred
outputs has seen significant recent contributions. Between text and
image-generative models, we narrowed our focus to text-based generative models,
particularly to produce captions for images that align with human preferences.
In this research, we explored a potential method to amplify the performance of
the Deep Neural Network Model to generate captions that are preferred by
humans. This was achieved by integrating Supervised Learning and Reinforcement
Learning with Human Feedback (RLHF) using the Flickr8k dataset. Also, a novel
loss function that is capable of optimizing the model based on human feedback
is introduced. In this paper, we provide a concise sketch of our approach and
results, hoping to contribute to the ongoing advances in the field of
human-aligned generative AI models.
Related papers
- Detecting Human Artifacts from Text-to-Image Models [16.261759535724778]
This dataset contains images containing images containing images containing a human body.
Images include images of poorly generated human bodies, including distorted and missing parts of the human body.
arXiv Detail & Related papers (2024-11-21T05:02:13Z) - Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models [46.09562860220433]
We introduce GazeReward, a novel framework that integrates implicit feedback -- and specifically eye-tracking (ET) data -- into the Reward Model (RM)
Our approach significantly improves the accuracy of the RM on established human preference datasets.
arXiv Detail & Related papers (2024-10-02T13:24:56Z) - Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback [5.9726297901501475]
We introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO)
Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback.
Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation.
arXiv Detail & Related papers (2024-05-30T16:18:05Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.