Related papers: In-Video Instructions: Visual Signals as Generative Control

In-Video Instructions: Visual Signals as Generative Control

URL: http://arxiv.org/abs/2511.19401v1
Date: Mon, 24 Nov 2025 18:38:45 GMT
Title: In-Video Instructions: Visual Signals as Generative Control
Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang,
Abstract summary: We investigate whether capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions.<n>In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories.<n>Experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions.
Score: 79.44662698914401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

Related papers

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation [88.95178842901095]
We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators.<n>We show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs.
arXiv Detail & Related papers (2025-11-25T17:59:07Z)
Show Me: Unifying Instructional Image and Video Generation with Diffusion Models [16.324312147741495]
We propose a unified framework that enables image manipulation and video prediction.<n>We introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence.<n> Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation.
arXiv Detail & Related papers (2025-11-21T23:24:28Z)
Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs [0.0]
We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection.<n>TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning.<n>We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection.
arXiv Detail & Related papers (2025-10-30T01:18:55Z)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects.<n>We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance.<n>We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification, captioning, and localization. We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z)
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision. Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting. This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
Make It Move: Controllable Image-to-Video Generation with Text Descriptions [69.52360725356601]
TI2V task aims at generating videos from a static image and a text description. To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor structure. Experiments conducted on datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task.
arXiv Detail & Related papers (2021-12-06T07:00:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.