Large Language Models are Temporal and Causal Reasoners for Video
Question Answering
- URL: http://arxiv.org/abs/2310.15747v2
- Date: Mon, 6 Nov 2023 12:20:33 GMT
- Title: Large Language Models are Temporal and Causal Reasoners for Video
Question Answering
- Authors: Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim
- Abstract summary: Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks.
We propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $langle$V, Q, A$rangle$ triplet.
Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question.
- Score: 16.722148605611146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown remarkable performances on a wide
range of natural language understanding and generation tasks. We observe that
the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$
for temporal and causal reasoning in Video Question Answering (VideoQA).
However, such priors often cause suboptimal results on VideoQA by leading the
model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$,
while ignoring visual content. This is also known as `ungrounded guesses' or
`hallucinations'. To address this problem while leveraging LLMs' prior on
VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to
predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping
the source pair and the target label to understand their complex relationships,
$\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs,
respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to
LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five
challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general
framework that is applicable to various LLMs (OPT and GPT-J) and consistently
improves their performances. We empirically demonstrate that Flipped-VQA not
only enhances the exploitation of linguistic shortcuts but also mitigates the
linguistic bias, which causes incorrect answers over-relying on the question.
Code is available at https://github.com/mlvlab/Flipped-VQA.
Related papers
- A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA)
Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4)
Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z) - Improving Zero-shot Visual Question Answering via Large Language Models
with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models.
We generate self-contained questions as reasoning question prompts via an unsupervised question edition module.
Each reasoning question prompt clearly indicates the intent of the original question.
Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks.
End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive.
We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Video Question Answering with Phrases via Semantic Roles [40.72894813542082]
Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases.
We leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task.
arXiv Detail & Related papers (2021-04-08T13:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.