Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
- URL: http://arxiv.org/abs/2003.00392v1
- Date: Sun, 1 Mar 2020 03:44:19 GMT
- Title: Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
- Authors: Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu
- Abstract summary: Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
- Score: 72.52804406378023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal retrieval between videos and texts has attracted growing
attentions due to the rapid emergence of videos on the web. The current
dominant approach for this problem is to learn a joint embedding space to
measure cross-modal similarities. However, simple joint embeddings are
insufficient to represent complicated visual and textual details, such as
scenes, objects, actions and their compositions. To improve fine-grained
video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model,
which decomposes video-text matching into global-to-local levels. To be
specific, the model disentangles texts into hierarchical semantic graph
including three levels of events, actions, entities and relationships across
levels. Attention-based graph reasoning is utilized to generate hierarchical
textual embeddings, which can guide the learning of diverse and hierarchical
video representations. The HGR model aggregates matchings from different
video-text levels to capture both global and local details. Experimental
results on three video-text datasets demonstrate the advantages of our model.
Such hierarchical decomposition also enables better generalization across
datasets and improves the ability to distinguish fine-grained semantic
differences.
Related papers
- Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Adaptive Hierarchical Graph Reasoning with Semantic Coherence for
Video-and-Language Inference [81.50675020698662]
Video-and-Language Inference is a recently proposed task for joint video-and-language understanding.
We propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions.
We introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
arXiv Detail & Related papers (2021-07-26T15:23:19Z) - HANet: Hierarchical Alignment Networks for Video-Text Retrieval [15.91922397215452]
Video-text retrieval is an important yet challenging task in vision-language understanding.
Most current works simply measure the video-text similarity based on video-level and text-level embeddings.
We propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching.
arXiv Detail & Related papers (2021-07-26T09:28:50Z) - Structure-Augmented Text Representation Learning for Efficient Knowledge
Graph Completion [53.31911669146451]
Human-curated knowledge graphs provide critical supportive information to various natural language processing tasks.
These graphs are usually incomplete, urging auto-completion of them.
graph embedding approaches, e.g., TransE, learn structured knowledge via representing graph elements into dense embeddings.
textual encoding approaches, e.g., KG-BERT, resort to graph triple's text and triple-level contextualized representations.
arXiv Detail & Related papers (2020-04-30T13:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.