Related papers: CAPEEN: Image Captioning with Early Exits and Knowledge Distillation

CAPEEN: Image Captioning with Early Exits and Knowledge Distillation

URL: http://arxiv.org/abs/2410.04433v1
Date: Sun, 6 Oct 2024 10:05:01 GMT
Title: CAPEEN: Image Captioning with Early Exits and Knowledge Distillation
Authors: Divya Jyoti Bajpai, Manjesh Kumar Hanawal,
Abstract summary: Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. EE strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. We introduce CAPEEN to improve the performance of EE strategies using knowledge distillation.
Score: 5.402030962296633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at https://github.com/Div290/CapEEN

Related papers

Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities. We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details. We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts [5.402030962296633]
Early Exit techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs) We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. We show that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x.
arXiv Detail & Related papers (2025-02-02T10:35:19Z)
CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression. We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations. During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT) RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z)
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation [43.09801987385207]
Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain datasets. Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations. We propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC.
arXiv Detail & Related papers (2024-10-16T07:13:35Z)
Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks. MLE training can lead the context vector to over-fit dominant image features in the training data. This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z)
Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning. We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations. Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z)
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks. The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z)
Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness. We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z)
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [108.13378788663196]
We propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process. We equip CoOp with Novel Learner Feature (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set.
arXiv Detail & Related papers (2022-11-04T02:06:22Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z)
Consensual Collaborative Training And Knowledge Distillation Based Facial Expression Recognition Under Noisy Annotations [2.538209532048867]
This work proposes an effective training strategy in the presence of noisy labels, called as Consensual Collaborative Training (CCT) framework. CCT co-trains three networks jointly using a convex combination of supervision loss and consistency loss. State-of-the-art performance is reported on the benchmark FER datasets RAFDB (90.84%), FERPlus (89.99%) and AffectNet (66%)
arXiv Detail & Related papers (2021-07-10T03:37:06Z)
Affect Expression Behaviour Analysis in the Wild using Consensual Collaborative Training [2.538209532048867]
This report presents Consensual Collaborative Training (CCT) framework used in our submission to the Affective Behaviour Analysis in-the-wild (ABAW) 2021 competition. CCT co-trains three networks jointly using a convex combination of supervision loss and consistency loss. Co-training reduces overall error, and consistency loss prevents overfitting to noisy samples.
arXiv Detail & Related papers (2021-07-08T04:28:21Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.