Related papers: Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach

Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach

URL: http://arxiv.org/abs/2511.17618v1
Date: Tue, 18 Nov 2025 13:45:50 GMT
Title: Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach
Authors: Ju-Young Oh,
Abstract summary: We introduce Foundational Question Generation for Video Questioning via an Embedding-Integrated Approach (FIQ)<n>FIQ is a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content.<n> Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventional VQA approaches primarily rely on question-answer (Q&A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q&A pairs from descriptive information extracted directly from videos, thereby enriching the dataset with core scene-level attributes. These generated pairs help the model develop a more holistic understanding of the video, leading to improved generalizability and reasoning performance. In addition, we propose a VQ-CAlign module that aligns task-specific question embeddings with corresponding visual features, preserving essential contextual cues and enhancing adaptability to downstream tasks. Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.

Related papers

FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering [26.585985828583304]
Video question of answering (VQA) is a task that requires the interpretation of a video to answer a given question.<n>We propose a novel approach designed to strengthen the reasoning ability of model by enhancing the fundamental understanding of videos.
arXiv Detail & Related papers (2025-07-17T06:19:38Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering [14.867263291053968]
We propose RElation-based rEpresentAtion Learning (REVEAL) to capture visual relation information.<n>Inspired by bytemporal scene graphs, we encode video sequences as sets of relation triplets in the form of (subjectit-predicate-object) over time via their language embeddings.<n>We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA.
arXiv Detail & Related papers (2025-04-07T19:54:04Z)
Towards Fine-Grained Video Question Answering [17.582244704442747]
This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset.<n>With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding.<n>We present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding.
arXiv Detail & Related papers (2025-03-10T01:02:01Z)
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering [13.294004180200496]
We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better.<n>LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions.<n>An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers.
arXiv Detail & Related papers (2024-12-12T12:39:07Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z)
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space. We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z)
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark. We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension. We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query. Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs. CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z)
Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.