Related papers: Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era

Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era

URL: http://arxiv.org/abs/2411.09955v2
Date: Thu, 21 Nov 2024 05:28:10 GMT
Title: Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era
Authors: Thanh Tam Nguyen, Zhao Ren, Trinh Pham, Thanh Trung Huynh, Phi Le Nguyen, Hongzhi Yin, Quoc Viet Hung Nguyen,
Abstract summary: Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. We aim to democratize powerful visual editing across various industries, from entertainment to education.
Score: 50.19334853510935
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.

Related papers

Exploring Multimodal Prompt for Visualization Authoring with Large Language Models [12.43647167483504]
We study how large language models (LLMs) interpret ambiguous or incomplete text prompts in the context of visualization authoring. We introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent. We design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations.
arXiv Detail & Related papers (2025-04-18T14:00:55Z)
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models [22.26930296101678]
Existing knowledge editing works primarily focus on text-oriented, coarse-grained scenarios. We propose a visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities.
arXiv Detail & Related papers (2024-11-19T14:49:36Z)
Visual Prompting in Multimodal Large Language Models: A Survey [95.75225825537528]
Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities. Visual prompting has emerged for more fine-grained and free-form visual instructions. This paper focuses on visual prompting, prompt generation, compositional reasoning, and prompt learning.
arXiv Detail & Related papers (2024-09-05T08:47:34Z)
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z)
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing [23.010237004536485]
Large language models (LLMs) can be integrated into the video editing workflow to reduce barriers to beginners. LAVE is a novel system that provides LLM-powered agent assistance and language-augmented editing features. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness.
arXiv Detail & Related papers (2024-02-15T19:53:11Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model. Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities. The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM. Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.