Image Content Generation with Causal Reasoning
- URL: http://arxiv.org/abs/2312.07132v1
- Date: Tue, 12 Dec 2023 10:07:16 GMT
- Title: Image Content Generation with Causal Reasoning
- Authors: Xiaochuan Li, Baoyu Fan, Runze Zhang, Liang Jin, Di Wang, Zhenhua Guo,
Yaqian Zhao, Rengang Li
- Abstract summary: ChatGPT has once again sparked research in generative artificial intelligence (GAI)
In visual modality, there is currently no equivalent research.
We propose a new image generation task called visual question answering with image (VQAI)
- Score: 17.89980837508069
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The emergence of ChatGPT has once again sparked research in generative
artificial intelligence (GAI). While people have been amazed by the generated
results, they have also noticed the reasoning potential reflected in the
generated textual content. However, this current ability for causal reasoning
is primarily limited to the domain of language generation, such as in models
like GPT-3. In visual modality, there is currently no equivalent research.
Considering causal reasoning in visual content generation is significant. This
is because visual information contains infinite granularity. Particularly,
images can provide more intuitive and specific demonstrations for certain
reasoning tasks, especially when compared to coarse-grained text. Hence, we
propose a new image generation task called visual question answering with image
(VQAI) and establish a dataset of the same name based on the classic
\textit{Tom and Jerry} animated series. Additionally, we develop a new paradigm
for image generation to tackle the challenges of this task. Finally, we perform
extensive experiments and analyses, including visualizations of the generated
content and discussions on the potentials and limitations. The code and data
are publicly available under the license of CC BY-NC-SA 4.0 for academic and
non-commercial usage. The code and dataset are publicly available at:
https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.
Related papers
- PixelArena: A benchmark for Pixel-Precision Visual Intelligence [2.8513276675793855]
In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision.<n>We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings.
arXiv Detail & Related papers (2025-12-18T08:41:27Z) - Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation [79.31152006811438]
Thinking-while-Generating (TwiG) is the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process.<n>To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning, and reinforcement learning.
arXiv Detail & Related papers (2025-11-20T18:59:52Z) - The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights [26.85150689408895]
We show that existing multimodal mathematical models minimally leverage visual information.<n>We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers.<n>In testing leading models, their failure to detect subtle visual differences suggests limitations in current visual perception capabilities.
arXiv Detail & Related papers (2025-03-06T07:29:33Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Evaluating Text-to-Visual Generation with Image-to-Text Generation [113.07368313330994]
VQAScore is a visual-question-answering (VQA) model to produce an alignment score.
It produces state-of-the-art results across many (8) image-text alignment benchmarks.
We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts.
arXiv Detail & Related papers (2024-04-01T17:58:06Z) - Spatial-Semantic Collaborative Cropping for User Generated Content [32.490403964193014]
A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people world-wide.
Previous methods merely consider the aesthetics of the cropped images while ignoring the content integrity, which is crucial for cropping.
We propose a Spatial-Semantic Collaborative cropping network (S2CNet) for arbitrary user generated content accompanied by a new cropping benchmark.
arXiv Detail & Related papers (2024-01-16T03:25:12Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
Image Captioning [153.98100182439165]
We introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo.
By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters.
We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks.
arXiv Detail & Related papers (2023-02-09T18:57:56Z) - Look, Read and Ask: Learning to Ask Questions by Reading Text in Images [3.3972119795940525]
We present a novel problem of text-based visual question generation or TextVQG.
To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
arXiv Detail & Related papers (2022-11-23T13:52:46Z) - Multi-VQG: Generating Engaging Questions for Multiple Images [9.965853054511165]
We propose generating engaging questions from multiple images.
Results show that building stories behind the image sequence enables models to generate engaging questions.
These results open up an exciting challenge for visual-and-language models to implicitly construct a story behind a series of photos.
arXiv Detail & Related papers (2022-11-14T15:15:00Z) - Visualize Before You Write: Imagination-Guided Open-Ended Text
Generation [68.96699389728964]
We propose iNLG that uses machine-generated images to guide language models in open-ended text generation.
Experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks.
arXiv Detail & Related papers (2022-10-07T18:01:09Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - One-shot Scene Graph Generation [130.57405850346836]
We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task.
Our method significantly outperforms existing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-02-22T11:32:59Z) - CIGLI: Conditional Image Generation from Language & Image [5.159265382427163]
We propose a new task called CIGLI: Conditional Image Generation from Language and Image.
Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt.
arXiv Detail & Related papers (2021-08-20T00:58:42Z) - VisualMRC: Machine Reading Comprehension on Document Images [4.057968826847943]
Given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language.
VisualMRC focuses more on developing natural language understanding and generation abilities.
It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages.
arXiv Detail & Related papers (2021-01-27T09:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.