Related papers: TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

URL: http://arxiv.org/abs/2511.18359v1
Date: Sun, 23 Nov 2025 09:12:48 GMT
Title: TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
Authors: Alexandros Stergiou,
Abstract summary: This paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos.<n> TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces.<n>In turn, logit scores define embedding directions for conditional video generation.
Score: 56.749972238005604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

Related papers

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO [20.96275248557104]
Video-Next-Event Prediction (VNEP) requires dynamic video responses to predict the next event in text.<n>We introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP.<n>The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit.<n> Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization.
arXiv Detail & Related papers (2025-11-20T18:59:44Z)
RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation [19.127189099122244]
We introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single step.<n>We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states.
arXiv Detail & Related papers (2025-11-06T12:42:03Z)
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.43882565434444]
We propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms.<n>First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types.<n>Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs.
arXiv Detail & Related papers (2025-07-07T00:51:57Z)
4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [126.02606196101259]
Sa2VA is a comprehensive, unified model for dense grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL.
arXiv Detail & Related papers (2025-01-07T18:58:54Z)
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning [0.0]
We present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings.
arXiv Detail & Related papers (2024-04-14T14:51:44Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties [13.938281516499119]
We implement textbfEmergent textbfIn-context textbfLearning on textbfVideos (eilev), a novel training paradigm that induces in-context learning over video and text. Our results, analysis, and eilev-trained models yield numerous insights about the emergence of in-context learning over video and text.
arXiv Detail & Related papers (2023-11-28T18:53:06Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
Make It Move: Controllable Image-to-Video Generation with Text Descriptions [69.52360725356601]
TI2V task aims at generating videos from a static image and a text description. To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor structure. Experiments conducted on datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task.
arXiv Detail & Related papers (2021-12-06T07:00:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.