"Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning
- URL: http://arxiv.org/abs/2306.00931v1
- Date: Thu, 1 Jun 2023 17:34:25 GMT
- Title: "Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning
- Authors: Abisek Rajakumar Kalarani, Pushpak Bhattacharyya, Niyati Chhaya, Sumit
Shekhar
- Abstract summary: We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model.
Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches.
Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets.
- Score: 40.01197694624958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Well-formed context aware image captions and tags in enterprise content such
as marketing material are critical to ensure their brand presence and content
recall. Manual creation and updates to ensure the same is non trivial given the
scale and the tedium towards this task. We propose a new unified
Vision-Language (VL) model based on the One For All (OFA) model, with a focus
on context-assisted image captioning where the caption is generated based on
both the image and its context. Our approach aims to overcome the
context-independent (image and text are treated independently) nature of the
existing approaches. We exploit context by pretraining our model with datasets
of three tasks: news image captioning where the news article is the context,
contextual visual entailment, and keyword extraction from the context. The
second pretraining task is a new VL task, and we construct and release two
datasets for the task with 1.1M and 2.2K data instances. Our system achieves
state-of-the-art results with an improvement of up to 8.34 CIDEr score on the
benchmark news image captioning datasets. To the best of our knowledge, ours is
the first effort at incorporating contextual information in pretraining the
models for the VL tasks.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis [6.066100464517522]
We introduce the Abstractive News Captions with High-level cOntext Representation dataset, containing 70K+ samples sourced from 5 different news media organizations.
Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights.
It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR.
arXiv Detail & Related papers (2024-04-15T21:19:10Z) - ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions [6.066100464517522]
Real-world image-caption pairs present in domains such as news data do not use simple and directly descriptive captions.
We launch ANNA, an Abstractive News captioNs dAtaset extracted from online news articles in a variety of different contexts.
We show that techniques such as transfer learning achieve limited success in understanding abstractive captions but still fail to consistently learn the relationships between content and context features.
arXiv Detail & Related papers (2023-01-05T17:19:01Z) - Focus! Relevant and Sufficient Context Selection for News Image
Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article.
We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article.
Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.