Related papers: Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

URL: http://arxiv.org/abs/2007.02503v1
Date: Mon, 6 Jul 2020 02:50:27 GMT
Title: Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
Authors: Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, Tat-Seng Chua
Abstract summary: The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries. We propose a Tree-augmented Cross-modal. method by jointly learning the linguistic structure of queries and the temporal representation of videos.
Score: 98.62404433761432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach.

Related papers

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding [23.022070084937603]
We introduce a semantics-driven search framework that reformulates selection under the paradigm of Visual Semantic-Logical Search. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics.
arXiv Detail & Related papers (2025-03-17T13:07:34Z)
Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG) We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z)
GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z)
Decomposing Complex Queries for Tip-of-the-tongue Retrieval [72.07449449115167]
Complex queries describe content elements (e.g., book characters or events), information beyond the document text. This retrieval setting, called tip of the tongue (TOT), is especially challenging for models reliant on lexical and semantic overlap between query and document text. We introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
arXiv Detail & Related papers (2023-05-24T11:43:40Z)
Dense but Efficient VideoQA for Intricate Compositional Reasoning [9.514382838449928]
We suggest a new VideoQA method based on transformer with a deformable attention mechanism to address the complex tasks. The dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the semantic relations among question words.
arXiv Detail & Related papers (2022-10-19T05:01:20Z)
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space. We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z)
The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval [11.217452391653762]
VISIONE allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial, relationships and image similarity. The peculiarity of our approach is that we encode all the information extracted from the videos using a convenient textual encoding in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) have to be merged.
arXiv Detail & Related papers (2020-08-06T16:32:17Z)
Message Passing Query Embedding [4.035753155957698]
We propose a graph neural network to encode a graph representation of a query. We show that the model learns entity embeddings that capture the notion of entity type without explicit supervision.
arXiv Detail & Related papers (2020-02-06T17:40:01Z)
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs. We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module. In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.