Related papers: Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

URL: http://arxiv.org/abs/2508.07330v2
Date: Sat, 16 Aug 2025 06:55:14 GMT
Title: Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos
Authors: Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran,
Abstract summary: Planner-Refiner is a framework to bridge semantic gaps between language and vision.<n>A Planner module schedules language guidance by decomposing complex linguistic prompts.<n>The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time.
Score: 13.618454017248801
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.

Related papers

Plan-X: Instruct Video Generation via Semantic Planning [36.020841550221824]
Plan-X is a framework that explicitly enforces high-level semantic planning to instruct video generation process.<n>Our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.
arXiv Detail & Related papers (2025-11-22T08:59:09Z)
Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding [30.223279362023337]
Video Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query.<n>Existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles.<n>We propose DualGround, a dual-branch architecture that explicitly separates global and local semantics.
arXiv Detail & Related papers (2025-10-23T05:53:01Z)
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer [50.69959748410398]
We introduce MingTok, a new family of visual tokenizers with a continuous latent space for unified autoregressive generation and understanding.<n>MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction.<n>Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm.
arXiv Detail & Related papers (2025-10-08T02:50:14Z)
Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation [17.238084264485988]
Referring Video Object (RVOS) aims to segment an object of interest throughout a video based on a language description.<n>bftextPARSE-VOS is a training-free framework powered by Large Language Models (LLMs)<n>bftextPARSE-VOS achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.
arXiv Detail & Related papers (2025-09-06T15:46:23Z)
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation [56.001484215308075]
We present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP.<n>We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process.<n> Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
arXiv Detail & Related papers (2024-11-28T19:00:03Z)
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z)
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.<n>Our method unifies the prompt and answer of visual referential tasks without using additional syntax.<n>ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z)
Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement [19.494104738436892]
We show that our framework can execute compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model by a large margin, especially for long instructions that involve compositions of multiple concepts.
arXiv Detail & Related papers (2023-04-27T17:55:13Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query. Existing methods independently extract the features of videos and sentences. We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.