Modeling Semantic Composition with Syntactic Hypergraph for Video
Question Answering
- URL: http://arxiv.org/abs/2205.06530v1
- Date: Fri, 13 May 2022 09:28:13 GMT
- Title: Modeling Semantic Composition with Syntactic Hypergraph for Video
Question Answering
- Authors: Zenan Xu, Wanjun Zhong, Qinliang Su, Zijing Ou and Fuwei Zhang
- Abstract summary: Key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects.
We propose to first build a syntactic dependency tree for each question with an off-the-shelf tool.
Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges.
- Score: 14.033438649614219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key challenge in video question answering is how to realize the cross-modal
semantic alignment between textual concepts and corresponding visual objects.
Existing methods mostly seek to align the word representations with the video
regions. However, word representations are often not able to convey a complete
description of textual concepts, which are in general described by the
compositions of certain words. To address this issue, we propose to first build
a syntactic dependency tree for each question with an off-the-shelf tool and
use it to guide the extraction of meaningful word compositions. Based on the
extracted compositions, a hypergraph is further built by viewing the words as
nodes and the compositions as hyperedges. Hypergraph convolutional networks
(HCN) are then employed to learn the initial representations of word
compositions. Afterwards, an optimal transport based method is proposed to
perform cross-modal semantic alignment for the textual and visual semantic
space. To reflect the cross-modal influences, the cross-modal information is
incorporated into the initial representations, leading to a model named
cross-modality-aware syntactic HCN. Experimental results on three benchmarks
show that our method outperforms all strong baselines. Further analyses
demonstrate the effectiveness of each component, and show that our model is
good at modeling different levels of semantic compositions and filtering out
irrelevant information.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - An Empirical Study on Leveraging Position Embeddings for Target-oriented
Opinion Words Extraction [13.765146062545048]
Target-oriented opinion words extraction (TOWE) is a new subtask of target-oriented sentiment analysis.
We show that BiLSTM-based models can effectively encode position information into word representations.
We also adapt a graph convolutional network (GCN) to enhance word representations by incorporating syntactic information.
arXiv Detail & Related papers (2021-09-02T22:49:45Z) - Multiplex Graph Neural Network for Extractive Text Summarization [34.185093491514394]
Extractive text summarization aims at extracting the most representative sentences from a given document as its summary.
We propose a novel Multiplex Graph Convolutional Network (Multi-GCN) to jointly model different types of relationships among sentences and words.
Based on Multi-GCN, we propose a Multiplex Graph Summarization (Multi-GraS) model for extractive text summarization.
arXiv Detail & Related papers (2021-08-29T16:11:01Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.