Related papers: Generative Timelines for Instructed Visual Assembly

Generative Timelines for Instructed Visual Assembly

URL: http://arxiv.org/abs/2411.12293v1
Date: Tue, 19 Nov 2024 07:26:30 GMT
Title: Generative Timelines for Instructed Visual Assembly
Authors: Alejandro Pardo, Jui-Hsien Wang, Bernard Ghanem, Josef Sivic, Bryan Russell, Fabian Caba Heilbron,
Abstract summary: The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions. We propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks.
Score: 106.80501761556606
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.

Related papers

Get In Video: Add Anything You Want to the Video [48.06070610416688]
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage. Current approaches fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We introduce "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.
arXiv Detail & Related papers (2025-03-08T16:27:53Z)
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions [57.304601070962086]
The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image. Part of the challenge stems from the lack of large-scale training data for this problem. We introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with the provided input image.
arXiv Detail & Related papers (2024-12-02T21:40:17Z)
ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location. This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens. ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z)
Generative Visual Instruction Tuning [11.727612242016871]
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model. We produce GenLLaVA, a Generative Large Language and Visual Assistant. Our model demonstrates visual understanding capabilities superior to LLaVA and demonstrates competitive results with native multimodal models.
arXiv Detail & Related papers (2024-06-17T07:06:58Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs) Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Text-Conditioned Resampler For Long Form Video Understanding [94.81955667020867]
We present a text-conditioned video resampler (TCR) module that uses a pre-trained visual encoder and large language model (LLM) TCR can process more than 100 frames at a time with plain attention and without optimised implementations.
arXiv Detail & Related papers (2023-12-19T06:42:47Z)
Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z)
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices. Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training. To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences [25.299599341774204]
This paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space. We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames.
arXiv Detail & Related papers (2020-01-29T22:51:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.