Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach
- URL: http://arxiv.org/abs/2511.17618v1
- Date: Tue, 18 Nov 2025 13:45:50 GMT
- Title: Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach
- Authors: Ju-Young Oh,
- Abstract summary: We introduce Foundational Question Generation for Video Questioning via an Embedding-Integrated Approach (FIQ)<n>FIQ is a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content.<n> Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional VQA approaches primarily rely on question-answer (Q&A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q&A pairs from descriptive information extracted directly from videos, thereby enriching the dataset with core scene-level attributes. These generated pairs help the model develop a more holistic understanding of the video, leading to improved generalizability and reasoning performance. In addition, we propose a VQ-CAlign module that aligns task-specific question embeddings with corresponding visual features, preserving essential contextual cues and enhancing adaptability to downstream tasks. Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.
Related papers
- FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering [26.585985828583304]
Video question of answering (VQA) is a task that requires the interpretation of a video to answer a given question.<n>We propose a novel approach designed to strengthen the reasoning ability of model by enhancing the fundamental understanding of videos.
arXiv Detail & Related papers (2025-07-17T06:19:38Z) - ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - REVEAL: Relation-based Video Representation Learning for Video-Question-Answering [14.867263291053968]
We propose RElation-based rEpresentAtion Learning (REVEAL) to capture visual relation information.<n>Inspired by bytemporal scene graphs, we encode video sequences as sets of relation triplets in the form of (subjectit-predicate-object) over time via their language embeddings.<n>We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA.
arXiv Detail & Related papers (2025-04-07T19:54:04Z) - Towards Fine-Grained Video Question Answering [17.582244704442747]
This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset.<n>With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding.<n>We present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding.
arXiv Detail & Related papers (2025-03-10T01:02:01Z) - Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering [13.294004180200496]
We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better.<n>LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions.<n>An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers.
arXiv Detail & Related papers (2024-12-12T12:39:07Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.