Related papers: TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model

TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model

URL: http://arxiv.org/abs/2507.05790v1
Date: Tue, 08 Jul 2025 08:51:56 GMT
Title: TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model
Authors: Yujie Hu, Xuanyu Zhang, Weiqi Li, Jian Zhang,
Abstract summary: This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions.<n>We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models.<n>With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks.
Score: 19.347698118395673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Virtual try-on has made significant progress in recent years. This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions, including full outfit change and local editing. Previous methods primarily relied on end-to-end networks to perform single try-on tasks, lacking versatility and flexibility. We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models to analyze user instructions and determine which task to execute, thereby activating different processing pipelines accordingly. Additionally, we introduce an instruction-based local repainting model that eliminates the need for users to manually provide masks. With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks. The experimental results demonstrate better semantic consistency and visual quality compared to the current methods.

Related papers

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals.<n>Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance.<n>We propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z)
Helping Language Models Learn More: Multi-dimensional Task Prompt for Few-shot Tuning [36.14688633670085]
We propose MTPrompt, a multi-dimensional task prompt learning method based on task-related object, summary, and task description information. By automatically building and searching for appropriate prompts, our proposed MTPrompt achieves the best results on few-shot samples setting and five different datasets.
arXiv Detail & Related papers (2023-12-13T10:00:44Z)
Intelligent Virtual Assistants with LLM-based Process Automation [31.275267197246595]
This paper proposes a novel LLM-based virtual assistant that can automatically perform multi-step operations within mobile apps based on high-level user requests. The system represents an advance in assistants by providing an end-to-end solution for parsing instructions, reasoning about goals, and executing actions.
arXiv Detail & Related papers (2023-12-04T07:51:58Z)
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation [59.24938416319019]
InstructSeq is an instruction-conditioned multi-modal modeling framework. It unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.
arXiv Detail & Related papers (2023-11-30T18:59:51Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification. The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
Dynamic Prompting: A Unified Framework for Prompt Tuning [33.175097465669374]
We present a unified dynamic prompt (DP) tuning strategy that dynamically determines different factors of prompts based on specific tasks and instances. Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks. We establish the universal applicability of our approach under full-data, few-shot, and multitask scenarios.
arXiv Detail & Related papers (2023-03-06T06:04:46Z)
Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly. We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z)
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation. The proposed UniVL is capable of handling both understanding tasks and generative tasks. Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.