UniVid: The Open-Source Unified Video Model
- URL: http://arxiv.org/abs/2509.24200v2
- Date: Tue, 30 Sep 2025 09:46:27 GMT
- Title: UniVid: The Open-Source Unified Video Model
- Authors: Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang,
- Abstract summary: We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter.<n>Experiments on standard benchmarks demonstrate state-of-the-art performance.
- Score: 41.15980565061684
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines. Code: https://github.com/AIGeeksGroup/UniVid. Website: https://aigeeksgroup.github.io/UniVid.
Related papers
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation [14.141157176094737]
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions.<n>Existing I2V pipelines often suffer from appearance drift and geometric distortion.<n>We propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views.
arXiv Detail & Related papers (2026-02-10T18:59:51Z) - Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks [4.888851550406879]
This paper proposes a weak-vision framework to tackle the automatic recognition of "concealed emotions" in videos.<n>Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts from under 0.6 in prior work to over 0.69.
arXiv Detail & Related papers (2026-02-08T17:02:55Z) - FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging [27.981298261747288]
FlashVID is a training-free acceleration framework for Video Large Language Models (VLLMs)<n>It selects the most representative tokens for basic video representation, then applies Tree-based Stemporal Tokenging (TSTM) for fine-temporal redundancy.<n>FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL.
arXiv Detail & Related papers (2026-02-08T15:56:46Z) - LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs [52.24096832965001]
We present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method.<n>The PVC method can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding.<n>Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x.
arXiv Detail & Related papers (2025-11-26T08:11:10Z) - Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification [1.4270165633706586]
Video-based person re-identification (ReID) in cross-view domains remains an open problem.<n>We propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone.<n>We show that MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark.
arXiv Detail & Related papers (2025-11-04T13:37:59Z) - EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization [17.622013322533423]
We introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks.<n> EVA02-AT efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining.<n>We introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs.
arXiv Detail & Related papers (2025-06-17T09:51:51Z) - Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation [8.912201177914858]
Saliency-Motion guided Trunk-Collateral Network (SMTC-Net)<n>We propose a novel Trunk-Collateral structure for motion-appearance video object segmentation (UVOS)<n>SMTC-Net achieved state-of-the-art performance on three UVOS datasets.
arXiv Detail & Related papers (2025-04-08T11:02:14Z) - ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.<n>ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.<n>We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations [23.188508465235717]
We propose two strategies to enhance the model's capability in video understanding tasks.
The first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE.
The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask.
arXiv Detail & Related papers (2024-09-05T02:54:17Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.