Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
- URL: http://arxiv.org/abs/2007.02503v1
- Date: Mon, 6 Jul 2020 02:50:27 GMT
- Title: Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
- Authors: Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, Tat-Seng Chua
- Abstract summary: The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems.
Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries.
We propose a Tree-augmented Cross-modal.
method by jointly learning the linguistic structure of queries and the temporal representation of videos.
- Score: 98.62404433761432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of user-generated videos on the Internet has intensified the
need for text-based video retrieval systems. Traditional methods mainly favor
the concept-based paradigm on retrieval with simple queries, which are usually
ineffective for complex queries that carry far more complex semantics.
Recently, embedding-based paradigm has emerged as a popular approach. It aims
to map the queries and videos into a shared embedding space where
semantically-similar texts and videos are much closer to each other. Despite
its simplicity, it forgoes the exploitation of the syntactic structure of text
queries, making it suboptimal to model the complex queries.
To facilitate video retrieval with complex queries, we propose a
Tree-augmented Cross-modal Encoding method by jointly learning the linguistic
structure of queries and the temporal representation of videos. Specifically,
given a complex user query, we first recursively compose a latent semantic tree
to structurally describe the text query. We then design a tree-augmented query
encoder to derive structure-aware query representation and a temporal attentive
video encoder to model the temporal characteristics of videos. Finally, both
the query and videos are mapped into a joint embedding space for matching and
ranking. In this approach, we have a better understanding and modeling of the
complex queries, thereby achieving a better video retrieval performance.
Extensive experiments on large scale video retrieval benchmark datasets
demonstrate the effectiveness of our approach.
Related papers
- Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - Decomposing Complex Queries for Tip-of-the-tongue Retrieval [72.07449449115167]
Complex queries describe content elements (e.g., book characters or events), information beyond the document text.
This retrieval setting, called tip of the tongue (TOT), is especially challenging for models reliant on lexical and semantic overlap between query and document text.
We introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
arXiv Detail & Related papers (2023-05-24T11:43:40Z) - Dense but Efficient VideoQA for Intricate Compositional Reasoning [9.514382838449928]
We suggest a new VideoQA method based on transformer with a deformable attention mechanism to address the complex tasks.
The dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the semantic relations among question words.
arXiv Detail & Related papers (2022-10-19T05:01:20Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search
Engines for Large-Scale Video Retrieval [11.217452391653762]
VISIONE allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial, relationships and image similarity.
The peculiarity of our approach is that we encode all the information extracted from the videos using a convenient textual encoding in a single text retrieval engine.
This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) have to be merged.
arXiv Detail & Related papers (2020-08-06T16:32:17Z) - Message Passing Query Embedding [4.035753155957698]
We propose a graph neural network to encode a graph representation of a query.
We show that the model learns entity embeddings that capture the notion of entity type without explicit supervision.
arXiv Detail & Related papers (2020-02-06T17:40:01Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.