Efficient Image Captioning for Edge Devices
- URL: http://arxiv.org/abs/2212.08985v1
- Date: Sun, 18 Dec 2022 01:56:33 GMT
- Title: Efficient Image Captioning for Edge Devices
- Authors: Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo
Jia, Linlin Li
- Abstract summary: We propose LightCap, a lightweight image captioner for resource-limited devices.
The core design is built on the recent CLIP model for efficient image captioning.
With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98%.
- Score: 8.724184244203892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed the rapid progress of image captioning. However,
the demands for large memory storage and heavy computational burden prevent
these captioning models from being deployed on mobile devices. The main
obstacles lie in the heavyweight visual feature extractors (i.e., object
detectors) and complicated cross-modal fusion networks. To this end, we propose
LightCap, a lightweight image captioner for resource-limited devices. The core
design is built on the recent CLIP model for efficient image captioning. To be
specific, on the one hand, we leverage the CLIP model to extract the compact
grid features without relying on the time-consuming object detectors. On the
other hand, we transfer the image-text retrieval design of CLIP to image
captioning scenarios by devising a novel visual concept extractor and a
cross-modal modulator. We further optimize the cross-modal fusion model and
parallel prediction heads via sequential and ensemble distillations. With the
carefully designed architecture, our model merely contains 40M parameters,
saving the model size by more than 75% and the FLOPs by more than 98% in
comparison with the current state-of-the-art methods. In spite of the low
capacity, our model still exhibits state-of-the-art performance on prevalent
datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the
smartphone with only a single CPU, the proposed LightCap exhibits a fast
inference speed of 188ms per image, which is ready for practical applications.
Related papers
- Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection [13.840950434728533]
State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models.
We leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network.
Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement.
arXiv Detail & Related papers (2024-02-29T12:18:43Z) - MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval [7.233106731197739]
We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models.
We implement a lightweight CLIP model on Snapdragon/Dimensity chips with only $sim$100M running memory and $sim$8.0ms search latency.
arXiv Detail & Related papers (2023-10-30T15:38:43Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - ImageSig: A signature transform for ultra-lightweight image recognition [0.0]
ImageSig is based on computing signatures and does not require a convolutional structure or an attention-based encoder.
ImageSig shows unprecedented performance on hardware such as Raspberry Pi and Jetson-nano.
arXiv Detail & Related papers (2022-05-13T23:48:32Z) - Leaner and Faster: Two-Stage Model Compression for Lightweight
Text-Image Retrieval [18.088550230146247]
Current text-image approaches (e.g., CLIP) typically adopt dual-encoder architecture us- ing pre-trained vision-language representation.
We present an effective two-stage framework to compress large pre-trained dual-encoder for lightweight text-image retrieval.
arXiv Detail & Related papers (2022-04-29T07:29:06Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.