ControlCap: Controllable Region-level Captioning
- URL: http://arxiv.org/abs/2401.17910v3
- Date: Sat, 9 Mar 2024 10:23:26 GMT
- Title: ControlCap: Controllable Region-level Captioning
- Authors: Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Fang Wan,
Qixiang Ye
- Abstract summary: Region-level captioning is challenged by the caption degeneration issue.
Pre-trained multimodal models tend to predict the most frequent captions but miss the less frequent ones.
We propose a controllable region-level captioning approach, which introduces control words to a multimodal model.
- Score: 57.57406480228619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Region-level captioning is challenged by the caption degeneration issue,
which refers to that pre-trained multimodal models tend to predict the most
frequent captions but miss the less frequent ones. In this study, we propose a
controllable region-level captioning (ControlCap) approach, which introduces
control words to a multimodal model to address the caption degeneration issue.
In specific, ControlCap leverages a discriminative module to generate control
words within the caption space to partition it to multiple sub-spaces. The
multimodal model is constrained to generate captions within a few sub-spaces
containing the control words, which increases the opportunity of hitting less
frequent captions, alleviating the caption degeneration issue. Furthermore,
interactive control words can be given by either a human or an expert model,
which enables captioning beyond the training caption space, enhancing the
model's generalization ability. Extensive experiments on Visual Genome and
RefCOCOg datasets show that ControlCap respectively improves the CIDEr score by
21.6 and 2.2, outperforming the state-of-the-arts by significant margins. Code
is available at https://github.com/callsys/ControlCap.
Related papers
- Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation [118.5096631571738]
We present Any2Caption, a novel framework for controllable video generation under any condition.
By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions.
Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models.
arXiv Detail & Related papers (2025-03-31T17:59:01Z) - Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights [28.963204452040813]
Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain.
This paper introduces a novel domain of Controllable Contextualized Image Captioning (Ctrl-CIC)
We present two approaches, Prompting-based Controller (P-Ctrl) and Recalibration-based Controller (R-Ctrl) to generate focused captions.
arXiv Detail & Related papers (2024-07-16T07:32:48Z) - ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec [50.273832905535485]
We present ControlSpeech, a text-to-speech (TTS) system capable of fully mimicking the speaker's voice and enabling arbitrary control and adjustment of speaking style.
Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation.
arXiv Detail & Related papers (2024-06-03T11:15:16Z) - SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models [84.71887272654865]
We present SparseCtrl to enable flexible structure control with temporally sparse signals.
It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched.
The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
arXiv Detail & Related papers (2023-11-28T16:33:08Z) - Caption Anything: Interactive Image Description with Diverse Multimodal
Controls [14.628597750669275]
Controllable image captioning aims to describe the image with natural language following human purpose.
We present Caption AnyThing, a foundation model augmented image captioning framework.
Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework.
arXiv Detail & Related papers (2023-05-04T09:48:22Z) - Controllable Image Captioning via Prompting [9.935191668056463]
We show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles.
To be specific, we design a set of prompts to fine-tune the pre-trained image captioner.
In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts.
arXiv Detail & Related papers (2022-12-04T11:59:31Z) - Controllable Image Captioning [0.0]
We introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics.
We propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences.
arXiv Detail & Related papers (2022-04-28T07:47:49Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.