CAISE: Conversational Agent for Image Search and Editing
- URL: http://arxiv.org/abs/2202.11847v1
- Date: Thu, 24 Feb 2022 00:55:52 GMT
- Title: CAISE: Conversational Agent for Image Search and Editing
- Authors: Hyounghun Kim, Doo Soon Kim, Seunghyun Yoon, Franck Dernoncourt, Trung
Bui, Mohit Bansal
- Abstract summary: We propose a dataset of an automated Conversational Agent for Image Search and Editing (CAISE)
To our knowledge, this is the first dataset that provides conversational image search and editing annotations.
The functions that the assistant-annotators conduct with the tool are recorded as executable commands.
- Score: 109.57721903485663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Demand for image editing has been increasing as users' desire for expression
is also increasing. However, for most users, image editing tools are not easy
to use since the tools require certain expertise in photo effects and have
complex interfaces. Hence, users might need someone to help edit their images,
but having a personal dedicated human assistant for every user is impossible to
scale. For that reason, an automated assistant system for image editing is
desirable. Additionally, users want more image sources for diverse image
editing works, and integrating an image search functionality into the editing
tool is a potential remedy for this demand. Thus, we propose a dataset of an
automated Conversational Agent for Image Search and Editing (CAISE). To our
knowledge, this is the first dataset that provides conversational image search
and editing annotations, where the agent holds a grounded conversation with
users and helps them to search and edit images according to their requests. To
build such a system, we first collect image search and editing conversations
between pairs of annotators. The assistant-annotators are equipped with a
customized image search and editing tool to address the requests from the
user-annotators. The functions that the assistant-annotators conduct with the
tool are recorded as executable commands, allowing the trained system to be
useful for real-world application execution. We also introduce a
generator-extractor baseline model for this task, which can adaptively select
the source of the next token (i.e., from the vocabulary or from textual/visual
contexts) for the executable command. This serves as a strong starting point
while still leaving a large human-machine performance gap for useful future
work. Our code and dataset are publicly available at:
https://github.com/hyounghk/CAISE
Related papers
- Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation [20.954269395301885]
We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification.
We find that an improved model can reduce the necessity for multiple rounds of adjustments.
arXiv Detail & Related papers (2025-01-25T10:32:00Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations [109.65267337037842]
We introduce the task of Image Editing Recommendation (IER)
IER aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose.
We introduce Creativity-Vision Language Assistant(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation.
arXiv Detail & Related papers (2024-05-31T18:22:29Z) - Divide and Conquer: Language Models can Plan and Self-Correct for
Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core.
Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z) - The Contemporary Art of Image Search: Iterative User Intent Expansion
via Vision-Language Model [4.531548217880843]
We introduce an innovative user intent expansion framework for image search.
Our framework leverages visual-language models to parse and compose multi-modal user inputs.
The proposed framework significantly improves users' image search experience.
arXiv Detail & Related papers (2023-12-04T06:14:25Z) - Edit As You Wish: Video Caption Editing with Multi-grained User Control [61.76233268900959]
We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests.
Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
arXiv Detail & Related papers (2023-05-15T07:12:19Z) - CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via
Dialogue [17.503012018823902]
This paper introduces the ChatEdit benchmark dataset for evaluating image editing and conversation abilities.
ChatEdit is constructed from the CelebA-HQ dataset, incorporating annotated multi-turn dialogues corresponding to user edit requests on the images.
We present a novel baseline framework that integrates a dialogue module for both tracking user requests and generating responses.
arXiv Detail & Related papers (2023-03-20T13:45:58Z) - NICER: Aesthetic Image Enhancement with Humans in the Loop [0.7756211500979312]
This work proposes a neural network based approach to no-reference image enhancement in a fully-, semi-automatic or fully manual process.
We show that NICER can improve image aesthetics without user interaction and that allowing user interaction leads to diverse enhancement outcomes.
arXiv Detail & Related papers (2020-12-03T09:14:10Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.