Related papers: Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

URL: http://arxiv.org/abs/2405.05852v1
Date: Thu, 9 May 2024 15:39:54 GMT
Title: Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Authors: Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner,
Abstract summary: Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
Score: 73.6361029556484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.

Related papers

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts [5.724899979571379]
Vision Language Models can outperform their underlying Large Language Models on purely text-only tasks.<n>We show that visual training changes the model's internal binding strategy.<n>Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
arXiv Detail & Related papers (2026-02-16T20:43:12Z)
Exploring Conditions for Diffusion models in Robotic Control [70.27711404291573]
We explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control.<n>We find that naively applying textual conditions yields minimal or even negative gains in control tasks.<n>We propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details.
arXiv Detail & Related papers (2025-10-17T10:24:14Z)
Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization [75.88719716002014]
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains.<n>Recent advances in pre-trained Visual Foundation Models (VFMs) have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models.<n>We propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM.
arXiv Detail & Related papers (2025-07-03T03:52:37Z)
Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z)
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z)
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps [5.836227628651603]
We propose a pipeline leveraging Large Language Models, open-vocabulary detectors, cross-attention maps and diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks.
arXiv Detail & Related papers (2025-01-23T19:26:14Z)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z)
CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment [23.36770607997754]
We propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone. Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.
arXiv Detail & Related papers (2024-04-01T13:57:46Z)
Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z)
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z)
Curious Representation Learning for Embodied Intelligence [81.21764276106924]
Self-supervised representation learning has achieved remarkable success in recent years. Yet to build truly intelligent agents, we must construct representation learning algorithms that can learn from environments. We propose a framework, curious representation learning, which jointly learns a reinforcement learning policy and a visual representation model.
arXiv Detail & Related papers (2021-05-03T17:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.