TG-VQA: Ternary Game of Video Question Answering
- URL: http://arxiv.org/abs/2305.10049v2
- Date: Thu, 18 May 2023 06:27:06 GMT
- Title: TG-VQA: Ternary Game of Video Question Answering
- Authors: Hao Li, Peng Jin, Zesen Cheng, Songyang Zhang, Kai Chen, Zhennan Wang,
Chang Liu, Jie Chen
- Abstract summary: Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them.
In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies.
Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts.
- Score: 33.180788803602084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video question answering aims at answering a question about the video content
by reasoning the alignment semantics within them. However, since relying
heavily on human instructions, i.e., annotations or priors, current contrastive
learning-based VideoQA methods remains challenging to perform fine-grained
visual-linguistic alignments. In this work, we innovatively resort to game
theory, which can simulate complicated relationships among multiple players
with specific interaction strategies, e.g., video, question, and answer as
ternary players, to achieve fine-grained alignment for VideoQA task.
Specifically, we carefully design a VideoQA-specific interaction strategy to
tailor the characteristics of VideoQA, which can mathematically generate the
fine-grained visual-linguistic alignment label without label-intensive efforts.
Our TG-VQA outperforms existing state-of-the-art by a large margin (more than
5%) on long-term and short-term VideoQA datasets, verifying its effectiveness
and generalization ability. Thanks to the guidance of game-theoretic
interaction, our model impressively convergences well on limited data (${10}^4
~videos$), surpassing most of those pre-trained on large-scale data
($10^7~videos$).
Related papers
- Multi-object event graph representation learning for Video Question Answering [4.236280446793381]
We propose a contrastive language event graph representation learning method called CLanG to address this limitation.
Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA, NExT-QA and TGIF-QA-R datasets.
arXiv Detail & Related papers (2024-09-12T04:42:51Z) - Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex
and Professional Sports [90.79212954022218]
We introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task.
Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions.
We propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering.
arXiv Detail & Related papers (2024-01-03T02:22:34Z) - ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.