X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2509.21559v1
- Date: Thu, 25 Sep 2025 20:39:45 GMT
- Title: X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
- Authors: Prasanna Reddy Pulakurthi, Jiamian Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Zhiqiang Tao,
- Abstract summary: This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning.<n>We first expand the existing benchmarks with additional video annotations to support semantic understanding.<n>X-CoT empirically improves the retrieval performance and produces detailed rationales.
- Score: 23.9465771255843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.
Related papers
- RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval [99.33724613432922]
We introduce RANKVIDEO, a reasoning-based reranker for video retrieval.<n>RANKVIDEO explicitly reasons over query-video pairs using video content to assess relevance.<n> Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework.
arXiv Detail & Related papers (2026-02-02T18:40:37Z) - TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval [1.8434042562191815]
We propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC.<n>Our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames.<n> Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks.
arXiv Detail & Related papers (2025-04-07T03:33:14Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Unified Coarse-to-Fine Alignment for Video-Text Retrieval [71.85966033484597]
We propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Our model captures the cross-modal similarity information at different granularity levels.
We apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them.
arXiv Detail & Related papers (2023-09-18T19:04:37Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - What Are You Token About? Dense Retrieval as Distributions Over the
Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space.
We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - COIL: Revisit Exact Lexical Match in Information Retrieval with
Contextualized Inverted List [19.212507277554415]
COIL is a contextualized exact match retrieval architecture that brings semantic lexical matching.
COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.
arXiv Detail & Related papers (2021-04-15T00:53:54Z) - On Semantic Similarity in Video Retrieval [31.61611168620582]
We propose a move to semantic similarity video retrieval, where multiple videos/captions can be deemed equally relevant.
Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS)
arXiv Detail & Related papers (2021-03-18T09:12:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.