Caption Anything: Interactive Image Description with Diverse Multimodal
Controls
- URL: http://arxiv.org/abs/2305.02677v3
- Date: Thu, 6 Jul 2023 13:47:21 GMT
- Title: Caption Anything: Interactive Image Description with Diverse Multimodal
Controls
- Authors: Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li,
Mingqi Gao, Shanshan Zhao
- Abstract summary: Controllable image captioning aims to describe the image with natural language following human purpose.
We present Caption AnyThing, a foundation model augmented image captioning framework.
Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework.
- Score: 14.628597750669275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controllable image captioning is an emerging multimodal topic that aims to
describe the image with natural language following human purpose,
$\textit{e.g.}$, looking at the specified regions or telling in a particular
text style. State-of-the-art methods are trained on annotated pairs of input
controls and output captions. However, the scarcity of such well-annotated
multimodal data largely limits their usability and scalability for interactive
AI systems. Leveraging unimodal instruction-following foundation models is a
promising alternative that benefits from broader sources of data. In this
paper, we present Caption AnyThing (CAT), a foundation model augmented image
captioning framework supporting a wide range of multimodel controls: 1) visual
controls, including points, boxes, and trajectories; 2) language controls, such
as sentiment, length, language, and factuality. Powered by Segment Anything
Model (SAM) and ChatGPT, we unify the visual and language prompts into a
modularized framework, enabling the flexible combination between different
controls. Extensive case studies demonstrate the user intention alignment
capabilities of our framework, shedding light on effective user interaction
modeling in vision-language applications. Our code is publicly available at
https://github.com/ttengwang/Caption-Anything.
Related papers
- OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction [32.08995899903304]
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization.
Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
arXiv Detail & Related papers (2024-10-07T11:26:13Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2.
We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.