CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
Pre-trained Knowledge
- URL: http://arxiv.org/abs/2211.09371v3
- Date: Sun, 19 Mar 2023 12:31:20 GMT
- Title: CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
Pre-trained Knowledge
- Authors: Linli Yao, Weijing Chen, Qin Jin
- Abstract summary: We propose a plug-and-play framework, i.e. CapEnrich, to complement the generic image descriptions with more semantic details.
Our method significantly improves the descriptiveness and diversity of generated sentences for web images.
- Score: 44.31783230767321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically generating textual descriptions for massive unlabeled images on
the web can greatly benefit realistic web applications, e.g. multimodal
retrieval and recommendation. However, existing models suffer from the problem
of generating ``over-generic'' descriptions, such as their tendency to generate
repetitive sentences with common concepts for different images. These generic
descriptions fail to provide sufficient textual semantics for ever-changing web
images. Inspired by the recent success of Vision-Language Pre-training (VLP)
models that learn diverse image-text concept alignment during pretraining, we
explore leveraging their cross-modal pre-trained knowledge to automatically
enrich the textual semantics of image descriptions. With no need for additional
human annotations, we propose a plug-and-play framework, i.e CapEnrich, to
complement the generic image descriptions with more semantic details.
Specifically, we first propose an automatic data-building strategy to get
desired training sentences, based on which we then adopt prompting strategies,
i.e. learnable and template prompts, to incentivize VLP models to generate more
textual details. For learnable templates, we fix the whole VLP model and only
tune the prompt vectors, which leads to two advantages: 1) the pre-training
knowledge of VLP models can be reserved as much as possible to describe diverse
visual concepts; 2) only lightweight trainable parameters are required, so it
is friendly to low data resources. Extensive experiments show that our method
significantly improves the descriptiveness and diversity of generated sentences
for web images. The code is available at https://github.com/yaolinli/CapEnrich.
Related papers
- Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions [30.08331098481379]
We propose an innovative framework termed Image Textualization (IT)
IT automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models.
We show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions.
arXiv Detail & Related papers (2024-06-11T17:37:45Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Understanding Guided Image Captioning Performance across Domains [22.283016988026926]
We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text.
Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets.
arXiv Detail & Related papers (2020-12-04T00:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.