DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
- URL: http://arxiv.org/abs/2404.01342v1
- Date: Sun, 31 Mar 2024 06:28:15 GMT
- Title: DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
- Authors: Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji,
- Abstract summary: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.
We introduce DiffAgent, an agent designed to screen the accurate selection in seconds via API calls.
Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework.
- Score: 90.71963723884944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process that typically requires numerous trials. Drawing inspiration from the tool usage research of large language models (LLMs), we introduce DiffAgent, an LLM agent designed to screen the accurate selection in seconds via API calls. DiffAgent leverages a novel two-stage training framework, SFTA, enabling it to accurately align T2I API responses with user input in accordance with human preferences. To train and evaluate DiffAgent's capabilities, we present DABench, a comprehensive dataset encompassing an extensive range of T2I APIs from the community. Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework. Codes are available at https://github.com/OpenGVLab/DiffAgent.
Related papers
- Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - CoSense3D: an Agent-based Efficient Learning Framework for Collective Perception [0.552480439325792]
We propose an agent-based training framework that handles the deep learning modules and agent data separately to have a cleaner data flow structure.
This framework not only provides an API for prototyping the data processing pipeline and defining the gradient calculation for each agent, but also provides the user interface for interactive training, testing and data visualization.
arXiv Detail & Related papers (2024-04-29T11:40:27Z) - Time Series Representation Learning with Supervised Contrastive Temporal Transformer [8.223940676615857]
We develop a simple, yet novel fusion model, called: textbfSupervised textbfCOntrastive textbfTemporal textbfTransformer (SCOTT)
We first investigate suitable augmentation methods for various types of time series data to assist with learning change-invariant representations.
arXiv Detail & Related papers (2024-03-16T03:37:19Z) - AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning [100.14685774661959]
textbfAgentOhana aggregates agent trajectories from distinct environments, spanning a wide array of scenarios.
textbfxLAM-v0.1, a large action model tailored for AI agents, demonstrates exceptional performance across various benchmarks.
arXiv Detail & Related papers (2024-02-23T18:56:26Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Finding Meaningful Distributions of ML Black-boxes under Forensic
Investigation [25.79728190384834]
Given a poorly documented neural network model, we take the perspective of a forensic investigator who wants to find out the model's data domain.
We propose solving this problem by leveraging on comprehensive corpus such as ImageNet to select a meaningful distribution.
Our goal is to select a set of samples from the corpus for the given model.
arXiv Detail & Related papers (2023-05-10T03:25:23Z) - On the Effectiveness of Pretrained Models for API Learning [8.788509467038743]
Developers frequently use APIs to implement certain functionalities, such as parsing Excel Files, reading and writing text files line by line, etc.
Developers can greatly benefit from automatic API usage sequence generation based on natural language queries for building applications in a faster and cleaner manner.
Existing approaches utilize information retrieval models to search for matching API sequences given a query or use RNN-based encoder-decoder to generate API sequences.
arXiv Detail & Related papers (2022-04-05T20:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.