Language Models are Causal Knowledge Extractors for Zero-shot Video
Question Answering
- URL: http://arxiv.org/abs/2304.03754v1
- Date: Fri, 7 Apr 2023 17:45:49 GMT
- Title: Language Models are Causal Knowledge Extractors for Zero-shot Video
Question Answering
- Authors: Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang
- Abstract summary: Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video.
We propose a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA.
CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets.
- Score: 60.93164850492871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Causal Video Question Answering (CVidQA) queries not only association or
temporal relations but also causal relations in a video. Existing question
synthesis methods pre-trained question generation (QG) systems on reading
comprehension datasets with text descriptions as inputs. However, QG models
only learn to ask association questions (e.g., ``what is someone doing...'')
and result in inferior performance due to the poor transfer of association
knowledge to CVidQA, which focuses on causal questions like ``why is someone
doing ...''. Observing this, we proposed to exploit causal knowledge to
generate question-answer pairs, and proposed a novel framework, Causal
Knowledge Extraction from Language Models (CaKE-LM), leveraging causal
commonsense knowledge from language models to tackle CVidQA. To extract
knowledge from LMs, CaKE-LM generates causal questions containing two events
with one triggering another (e.g., ``score a goal'' triggers ``soccer player
kicking ball'') by prompting LM with the action (soccer player kicking ball) to
retrieve the intention (to score a goal). CaKE-LM significantly outperforms
conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and
Causal-VidQA datasets. We also conduct comprehensive analyses and provide key
findings for future research.
Related papers
- Improving Zero-shot Visual Question Answering via Large Language Models
with Reasoning Question Prompts [22.669502403623166]
We present Reasoning Question Prompts for VQA tasks, which can further activate the potential of Large Language Models.
We generate self-contained questions as reasoning question prompts via an unsupervised question edition module.
Each reasoning question prompt clearly indicates the intent of the original question.
Then, the candidate answers associated with their confidence scores acting as answer integritys are fed into LLMs.
arXiv Detail & Related papers (2023-11-15T15:40:46Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Event Extraction as Question Generation and Answering [72.04433206754489]
Recent work on Event Extraction has reframed the task as Question Answering (QA)
We propose QGA-EE, which enables a Question Generation (QG) model to generate questions that incorporate rich contextual information instead of using fixed templates.
Experiments show that QGA-EE outperforms all prior single-task-based models on the ACE05 English dataset.
arXiv Detail & Related papers (2023-07-10T01:46:15Z) - Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge
Graph Question Answering [7.888547093390469]
Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks.
We propose to augment the knowledge directly in the input of LLMs.
Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot.
arXiv Detail & Related papers (2023-06-07T04:15:21Z) - Search-in-the-Chain: Interactively Enhancing Large Language Models with
Search for Knowledge-intensive Tasks [121.74957524305283]
This paper proposes a novel framework named textbfSearch-in-the-Chain (SearChain) for the interaction between Information Retrieval (IR) and Large Language Model (LLM)
Experiments show that SearChain outperforms state-of-the-art baselines on complex knowledge-intensive tasks.
arXiv Detail & Related papers (2023-04-28T10:15:25Z) - Prophet: Prompting Large Language Models with Complementary Answer
Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering.
We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - Improving Unsupervised Question Answering via Summarization-Informed
Question Generation [47.96911338198302]
Question Generation (QG) is the task of generating a plausible question for a passage, answer> pair.
We make use of freely available news summary data, transforming declarative sentences into appropriate questions using dependency parsing, named entity recognition and semantic role labeling.
The resulting questions are then combined with the original news articles to train an end-to-end neural QG model.
arXiv Detail & Related papers (2021-09-16T13:08:43Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.