StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval
- URL: http://arxiv.org/abs/2601.20597v1
- Date: Wed, 28 Jan 2026 13:34:44 GMT
- Title: StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval
- Authors: Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie,
- Abstract summary: A key challenge in Continual Text-to-Video Retrieval is feature drift.<n>We propose StructAlign, a structured cross-modal alignment method for CTVR.<n>Our method consistently outperforms state-of-the-art continual retrieval approaches.
- Score: 75.28673512571449
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.
Related papers
- AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation [45.753757870577196]
We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
arXiv Detail & Related papers (2026-02-04T15:42:58Z) - TRACER: Texture-Robust Affordance Chain-of-Thought for Deformable-Object Refinement [37.77903164878976]
TRACER is a Texture-Robust Affordance Chain-of-thought with dEformable-object Refinement framework.<n>It provides a cross-hierarchical mapping from semantic reasoning to appearance-robust and physically consistent functional region refinement.<n>Experiments conducted on the Fine-AGDDO15 dataset and a real-world robotic platform demonstrate that TRACER significantly improves affordance grounding precision.
arXiv Detail & Related papers (2026-01-28T03:12:18Z) - STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution [60.06664986365803]
We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model.<n>It aims to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions.
arXiv Detail & Related papers (2025-11-24T05:37:23Z) - Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models [31.138079872368532]
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges.<n>Recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models.<n>We introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames.
arXiv Detail & Related papers (2025-06-10T20:34:47Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Temporal Consistency Learning of inter-frames for Video Super-Resolution [38.26035126565062]
Video super-resolution (VSR) is a task that aims to reconstruct high-resolution (HR) frames from the low-resolution (LR) reference frame and multiple neighboring frames.
Existing methods generally explore information propagation and frame alignment to improve the performance of VSR.
We propose a Temporal Consistency learning Network (TCNet) for VSR in an end-to-end manner, to enhance the consistency of the reconstructed videos.
arXiv Detail & Related papers (2022-11-03T08:23:57Z) - Exploring Motion Ambiguity and Alignment for High-Quality Video Frame
Interpolation [46.02120172459727]
We propose to relax the requirement of reconstructing an intermediate frame as close to the ground-truth (GT) as possible.
We develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames.
arXiv Detail & Related papers (2022-03-19T10:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.