Teacher-Critical Training Strategies for Image Captioning
- URL: http://arxiv.org/abs/2009.14405v1
- Date: Wed, 30 Sep 2020 03:15:12 GMT
- Title: Teacher-Critical Training Strategies for Image Captioning
- Authors: Yiqing Huang, Jiansheng Chen
- Abstract summary: We introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model.
We propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model.
- Score: 12.245773188050618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing image captioning models are usually trained by cross-entropy (XE)
loss and reinforcement learning (RL), which set ground-truth words as hard
targets and force the captioning model to learn from them. However, the widely
adopted training strategies suffer from misalignment in XE training and
inappropriate reward assignment in RL training. To tackle these problems, we
introduce a teacher model that serves as a bridge between the ground-truth
caption and the caption model by generating some easier-to-learn word proposals
as soft targets. The teacher model is constructed by incorporating the
ground-truth image attributes into the baseline caption model. To effectively
learn from the teacher model, we propose Teacher-Critical Training Strategies
(TCTS) for both XE and RL training to facilitate better learning processes for
the caption model. Experimental evaluations of several widely adopted caption
models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively
enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in
both training stages. TCTS is able to achieve to-date the best published single
model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy
test split. Our codes and pre-trained models will be open-sourced.
Related papers
- STCL:Curriculum learning Strategies for deep learning image steganography models [8.251354931895667]
This paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models.
The strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy.
Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance.
arXiv Detail & Related papers (2025-04-24T14:34:41Z) - A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models [17.144311122664508]
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior.
We propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images.
arXiv Detail & Related papers (2025-02-19T18:35:43Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model [49.587821411012705]
We propose ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model.
It distills the knowledge from a large teacher CLIP model into a smaller student model, ensuring comparable performance with significantly reduced parameters.
EduAttention explores the cross-relationships between text features extracted by the teacher model and image features extracted by the student model.
arXiv Detail & Related papers (2024-08-08T01:12:21Z) - CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning [4.004641316826348]
We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT)
Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets.
The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
arXiv Detail & Related papers (2024-07-30T17:57:32Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.