Towards Adaptable and Interactive Image Captioning with Data
Augmentation and Episodic Memory
- URL: http://arxiv.org/abs/2306.03500v1
- Date: Tue, 6 Jun 2023 08:38:10 GMT
- Title: Towards Adaptable and Interactive Image Captioning with Data
Augmentation and Episodic Memory
- Authors: Aliki Anagnostopoulou and Mareike Hartmann and Daniel Sonntag
- Abstract summary: We present an IML pipeline for image captioning which allows us to incrementally adapt a pre-trained model to a new data distribution based on user input.
We find that, while data augmentation worsens results, even when relatively small amounts of data are available, episodic memory is an effective strategy to retain knowledge from previously seen clusters.
- Score: 8.584932159968002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interactive machine learning (IML) is a beneficial learning paradigm in cases
of limited data availability, as human feedback is incrementally integrated
into the training process. In this paper, we present an IML pipeline for image
captioning which allows us to incrementally adapt a pre-trained image
captioning model to a new data distribution based on user input. In order to
incorporate user input into the model, we explore the use of a combination of
simple data augmentation methods to obtain larger data batches for each newly
annotated data instance and implement continual learning methods to prevent
catastrophic forgetting from repeated updates. For our experiments, we split a
domain-specific image captioning dataset, namely VizWiz, into non-overlapping
parts to simulate an incremental input flow for continually adapting the model
to new data. We find that, while data augmentation worsens results, even when
relatively small amounts of data are available, episodic memory is an effective
strategy to retain knowledge from previously seen clusters.
Related papers
- Data-efficient Event Camera Pre-training via Disentangled Masked
Modeling [20.987277885575963]
We present a new data-supervised voxel-based self-supervised learning method for event cameras.
Our method overcomes the limitations of previous methods, which either sacrifice temporal information or directly employ paired image data.
It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.
arXiv Detail & Related papers (2024-03-01T10:02:25Z) - Imitation Learning Inputting Image Feature to Each Layer of Neural
Network [1.6574413179773757]
Imitation learning enables robots to learn and replicate human behavior from training data.
Recent advances in machine learning enable end-to-end learning approaches that directly process high-dimensional observation data, such as images.
This paper presents a useful method to address this challenge, which amplifies the influence of data with a relatively low correlation to the output.
arXiv Detail & Related papers (2024-01-18T02:44:18Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes.
One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z) - Data Augmentation for Meta-Learning [58.47185740820304]
meta-learning algorithms sample data, query data, and tasks on each training step.
Data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes/tasks.
Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.
arXiv Detail & Related papers (2020-10-14T13:48:22Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.