TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
- URL: http://arxiv.org/abs/2510.07940v1
- Date: Thu, 09 Oct 2025 08:37:00 GMT
- Title: TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
- Authors: Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua,
- Abstract summary: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios.<n>We introduce Test-Time and Memo spatializationr (TTOM) to align VFMs with video layouts during inference for better text-image alignment.<n>We find that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization.
- Score: 102.55214293086863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
Related papers
- Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models [103.18486625853099]
DEFT, Decompositional Efficient Fine-Tuning, adapts a pre-trained weight matrix by decomposing its update into two components.<n>We conduct experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework.
arXiv Detail & Related papers (2025-09-26T18:01:15Z) - STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing [35.50656689789427]
STR-Match is a training-free video editing system that produces visually appealing and coherent videos.<n> STR-Match consistently outperforms existing methods in both visual quality andtemporal consistency.
arXiv Detail & Related papers (2025-06-28T12:36:19Z) - BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation [47.21414443162965]
We propose an autoregressive structure and texture propagation module (STPM) for customized text-to-video (CT2V) generation.<n> STPM extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency.<n>We also introduce a test-time reward optimization (TTRO) method to further refine fine-grained details.
arXiv Detail & Related papers (2025-05-11T14:11:12Z) - VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models [48.00262713744499]
VideoComp is a benchmark and learning framework for advancing video-text compositionality understanding.<n>We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions.<n>These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences.
arXiv Detail & Related papers (2025-04-04T22:24:30Z) - MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation [19.340437669928814]
MagicComp is a training-free method that enhances T2V generation through dual-phase refinement.<n>MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures.
arXiv Detail & Related papers (2025-03-18T17:02:14Z) - TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.