CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
Pre-trained Knowledge
- URL: http://arxiv.org/abs/2211.09371v3
- Date: Sun, 19 Mar 2023 12:31:20 GMT
- Title: CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
Pre-trained Knowledge
- Authors: Linli Yao, Weijing Chen, Qin Jin
- Abstract summary: We propose a plug-and-play framework, i.e. CapEnrich, to complement the generic image descriptions with more semantic details.
Our method significantly improves the descriptiveness and diversity of generated sentences for web images.
- Score: 44.31783230767321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically generating textual descriptions for massive unlabeled images on
the web can greatly benefit realistic web applications, e.g. multimodal
retrieval and recommendation. However, existing models suffer from the problem
of generating ``over-generic'' descriptions, such as their tendency to generate
repetitive sentences with common concepts for different images. These generic
descriptions fail to provide sufficient textual semantics for ever-changing web
images. Inspired by the recent success of Vision-Language Pre-training (VLP)
models that learn diverse image-text concept alignment during pretraining, we
explore leveraging their cross-modal pre-trained knowledge to automatically
enrich the textual semantics of image descriptions. With no need for additional
human annotations, we propose a plug-and-play framework, i.e CapEnrich, to
complement the generic image descriptions with more semantic details.
Specifically, we first propose an automatic data-building strategy to get
desired training sentences, based on which we then adopt prompting strategies,
i.e. learnable and template prompts, to incentivize VLP models to generate more
textual details. For learnable templates, we fix the whole VLP model and only
tune the prompt vectors, which leads to two advantages: 1) the pre-training
knowledge of VLP models can be reserved as much as possible to describe diverse
visual concepts; 2) only lightweight trainable parameters are required, so it
is friendly to low data resources. Extensive experiments show that our method
significantly improves the descriptiveness and diversity of generated sentences
for web images. The code is available at https://github.com/yaolinli/CapEnrich.
Related papers
- FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.
Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.
We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.
Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z) - Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions [30.08331098481379]
We propose an innovative framework termed Image Textualization (IT)
IT automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models.
We show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions.
arXiv Detail & Related papers (2024-06-11T17:37:45Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Understanding Guided Image Captioning Performance across Domains [22.283016988026926]
We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text.
Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets.
arXiv Detail & Related papers (2020-12-04T00:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.