Related papers: Generated Contents Enrichment

Generated Contents Enrichment

URL: http://arxiv.org/abs/2405.03650v2
Date: Tue, 11 Jun 2024 17:12:26 GMT
Title: Generated Contents Enrichment
Authors: Mahdi Naseri, Jiayan Qiu, Zhou Wang,
Abstract summary: We investigate a novel artificial intelligence generation task, termed as generated contents enrichment (GCE) Our proposed GCE strives to perform content enrichment explicitly on both the visual and textual domain. Our experiments conducted on the Visual Genome dataset exhibit promising and visually plausible results.
Score: 11.196681396888536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate a novel artificial intelligence generation task, termed as generated contents enrichment (GCE). Different from conventional artificial intelligence contents generation task that enriches the given textual description implicitly with limited semantics for generating visually real content, our proposed GCE strives to perform content enrichment explicitly on both the visual and textual domain, from which the enriched contents are visually real, structurally reasonable, and semantically abundant. Towards to solve GCE, we propose a deep end-to-end method that explicitly explores the semantics and inter-semantic relationships during the enrichment. Specifically, we first model the input description as a semantic graph, wherein each node represents an object and each edge corresponds to the inter-object relationship. We then adopt Graph Convolutional Networks on top of the input scene description to predict the enriching objects and their relationships with the input objects. Finally, the enriched description is fed into an image synthesis model to carry out the visual contents generation. Our experiments conducted on the Visual Genome dataset exhibit promising and visually plausible results.

Related papers

Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation [14.82606425343802]
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. Existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. We propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information.
arXiv Detail & Related papers (2024-12-26T02:12:37Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Structure Your Data: Towards Semantic Graph Counterfactuals [1.8817715864806608]
Counterfactual explanations (CEs) based on concepts are explanations that consider alternative scenarios to understand which high-level semantic features contributed to model predictions. In this work, we propose CEs based on the semantic graphs accompanying input data to achieve more descriptive, accurate, and human-aligned explanations.
arXiv Detail & Related papers (2024-03-11T08:40:37Z)
Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model [7.587729429265939]
Pictorial visualization seamlessly integrates data and semantic context into visual representation. We propose ChartSpark, a novel system that embeds semantic context into chart based on text-to-image generative model. We develop an interactive visual interface that integrates a text analyzer, editing module, and evaluation module to enable users to generate, modify, and assess pictorial visualizations.
arXiv Detail & Related papers (2023-04-28T05:18:30Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level. Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z)
Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning. To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions. We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z)
MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions. Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections. A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Exploring Explicit and Implicit Visual Relationships for Image Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks [5.840117063192334]
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
arXiv Detail & Related papers (2020-10-07T05:25:30Z)
GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions. The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z)
Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images. Existing SGG methods require millions of manually annotated bounding boxes for training. We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.