Related papers: Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

URL: http://arxiv.org/abs/2109.03892v1
Date: Wed, 8 Sep 2021 19:38:11 GMT
Title: Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models
Authors: Steven Y. Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, Varun Gangal
Abstract summary: We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation.
Score: 12.488828126859376
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.

Related papers

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing [66.33454784945293]
Generation Chain-of-Thought (GoT) is a novel paradigm that enables generation and editing through an explicit language reasoning process. GoT transforms conventional text-to-image generation and editing into a reasoning-guided framework.
arXiv Detail & Related papers (2025-03-13T17:59:59Z)
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork [38.55086153299993]
We develop an approach for text-to-image generation that embraces additional retrieval images, driven by a combination of implicit visual guidance loss and generative objectives. We propose a novel hypernetwork modulated visual-text encoding scheme to predict the weight update of the encoding layer. Experimental results show that our model guided with additional retrieval visual data outperforms existing GAN-based models.
arXiv Detail & Related papers (2022-08-17T19:25:00Z)
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation. For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor. We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z)
Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks. We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image. In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning. It is designed to pre-train text-to-image caption generators through three novel generation tasks. XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.