Putting Humans in the Image Captioning Loop
- URL: http://arxiv.org/abs/2306.03476v1
- Date: Tue, 6 Jun 2023 07:50:46 GMT
- Title: Putting Humans in the Image Captioning Loop
- Authors: Aliki Anagnostopoulou and Mareike Hartmann and Daniel Sonntag
- Abstract summary: We present work-in-progress on adapting an IC system to integrate human feedback.
Our approach builds on a base IC model pre-trained on the MS COCO dataset, which generates captions for unseen images.
We hope that this approach, while leading to improved results, will also result in customizable IC models.
- Score: 8.584932159968002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image Captioning (IC) models can highly benefit from human feedback in the
training process, especially in cases where data is limited. We present
work-in-progress on adapting an IC system to integrate human feedback, with the
goal to make it easily adaptable to user-specific data. Our approach builds on
a base IC model pre-trained on the MS COCO dataset, which generates captions
for unseen images. The user will then be able to offer feedback on the image
and the generated/predicted caption, which will be augmented to create
additional training instances for the adaptation of the model. The additional
instances are integrated into the model using step-wise updates, and a sparse
memory replay component is used to avoid catastrophic forgetting. We hope that
this approach, while leading to improved results, will also result in
customizable IC models.
Related papers
- Enhancing Image Caption Generation Using Reinforcement Learning with
Human Feedback [0.0]
We explore a potential method to amplify the performance of the Deep Neural Network Model to generate captions that are preferred by humans.
This was achieved by integrating Supervised Learning and Reinforcement Learning with Human Feedback.
We provide a sketch of our approach and results, hoping to contribute to the ongoing advances in the field of human-aligned generative AI models.
arXiv Detail & Related papers (2024-03-11T13:57:05Z) - Scaling Laws of Synthetic Images for Model Training ... for Now [54.43596959598466]
We study the scaling laws of synthetic images generated by state of the art text-to-image models.
We observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training.
arXiv Detail & Related papers (2023-12-07T18:59:59Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Evaluating Data Attribution for Text-to-Image Models [62.844382063780365]
We evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style.
Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction.
By taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images.
arXiv Detail & Related papers (2023-06-15T17:59:51Z) - Towards Adaptable and Interactive Image Captioning with Data
Augmentation and Episodic Memory [8.584932159968002]
We present an IML pipeline for image captioning which allows us to incrementally adapt a pre-trained model to a new data distribution based on user input.
We find that, while data augmentation worsens results, even when relatively small amounts of data are available, episodic memory is an effective strategy to retain knowledge from previously seen clusters.
arXiv Detail & Related papers (2023-06-06T08:38:10Z) - HIVE: Harnessing Human Feedback for Instructional Visual Editing [127.29436858998064]
We present a novel framework to harness human feedback for instructional visual editing (HIVE)
Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences.
We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward.
arXiv Detail & Related papers (2023-03-16T19:47:41Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.