Related papers: Queries Are Not Alone: Clustering Text Embeddings for Video Search

Queries Are Not Alone: Clustering Text Embeddings for Video Search

URL: http://arxiv.org/abs/2510.07720v1
Date: Thu, 09 Oct 2025 02:56:18 GMT
Title: Queries Are Not Alone: Clustering Text Embeddings for Video Search
Authors: Peyang Liu, Xi Wang, Ziqiang Cui, Wei Ye,
Abstract summary: This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope.<n>We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query.<n>We also introduce the Video-Text Cluster-Attention (VTC-Att), which adjusts the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features.
Score: 10.695503567368732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.

Related papers

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications [96.46069692338645]
We introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments.<n>Dense-WebVid-CoVR consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart.<n>We develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion.
arXiv Detail & Related papers (2025-08-19T17:59:39Z)
VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries.<n>VideoRAG is powered by recent Large Video Language Models (LVLMs)<n>We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z)
Towards Open-Vocabulary Video Semantic Segmentation [40.58291642595943]
We introduce the Open Vocabulary Video Semantic (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories.<n>To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module.<n>Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context.
arXiv Detail & Related papers (2024-12-12T14:53:16Z)
Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.<n>This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.<n>We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval [54.22321767540878]
Video moment retrieval (VMR) aims to locate the most likely video moment corresponding to a text query in untrimmed videos.<n>Training of existing methods is limited by the lack of diverse and generalisable VMR datasets.<n>We propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion.
arXiv Detail & Related papers (2024-01-24T09:45:40Z)
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset. Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.