Related papers: X-Aligner: Composed Visual Retrieval without the Bells and Whistles

X-Aligner: Composed Visual Retrieval without the Bells and Whistles

URL: http://arxiv.org/abs/2601.16582v1
Date: Fri, 23 Jan 2026 09:33:38 GMT
Title: X-Aligner: Composed Visual Retrieval without the Bells and Whistles
Authors: Yuqian Zheng, Mariana-Iuliana Georgescu,
Abstract summary: We propose a novel Composed Video Retrieval (CoVR) framework that leverages the representational power of Vision Language Models (VLMs)<n>Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs.<n>Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
Score: 5.3880484326593745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.

Related papers

EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition [54.55914886780534]
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion.<n>We introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR.<n>EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios.
arXiv Detail & Related papers (2026-02-13T13:25:05Z)
PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval [9.493866391853723]
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text.<n>Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs)<n>We introduce PREGEN, an efficient and powerful CoVR framework that overcomes these limitations.
arXiv Detail & Related papers (2026-01-20T09:57:04Z)
ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
A Simple Baseline with Single-encoder for Referring Image Segmentation [14.461024566536478]
We present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention.<n>Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets.
arXiv Detail & Related papers (2024-08-28T04:14:01Z)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information. Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z)
CoVR-2: Automatic Data Construction for Composed Video Retrieval [59.854331104466254]
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together. We propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs. We also expand the scope of the task to include composed video retrieval (CoVR)
arXiv Detail & Related papers (2023-08-28T17:55:33Z)
VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z)
VLG: General Video Recognition with Web Textual Knowledge [47.3660792813967]
We focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we present a unified visual-linguistic framework (VLG) Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
arXiv Detail & Related papers (2022-12-03T15:46:49Z)
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z)
Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR) Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level. We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.