Related papers: Listen Then See: Video Alignment with Speaker Attention

Listen Then See: Video Alignment with Speaker Attention

URL: http://arxiv.org/abs/2404.13530v1
Date: Sun, 21 Apr 2024 04:55:13 GMT
Title: Listen Then See: Video Alignment with Speaker Attention
Authors: Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam Heredia-Marin, Prabhdeep Singh Sethi,
Abstract summary: Socially Intelligent Question Answering (SIQA) requires context understanding, temporal reasoning, and the integration of multimodal information. We introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at https://github.com/sts-vlcc/sts-vlcc

Related papers

Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering [13.757806950813995]
Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions.<n>We propose a novel Query-guided Spatial--Temporal--Frequency interaction method to enhance audio--visual understanding.<n>Our proposed method achieves significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches.
arXiv Detail & Related papers (2026-01-27T17:24:32Z)
LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [10.060267989615813]
We introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding.<n> Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships.
arXiv Detail & Related papers (2025-07-20T01:57:00Z)
FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering [26.585985828583304]
Video question of answering (VQA) is a task that requires the interpretation of a video to answer a given question.<n>We propose a novel approach designed to strengthen the reasoning ability of model by enhancing the fundamental understanding of videos.
arXiv Detail & Related papers (2025-07-17T06:19:38Z)
Spoken question answering for visual queries [14.834200714168546]
This work aims to create a system that enables user interaction through both speech and images.<n>The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images.
arXiv Detail & Related papers (2025-05-29T10:06:48Z)
Cross-modal Causal Relation Alignment for Video Question Grounding [44.97933293141372]
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. Existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. We propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding.
arXiv Detail & Related papers (2025-03-05T01:36:32Z)
ReasVQA: Advancing VideoQA with Imperfect Reasoning Process [38.4638171723351]
textbfReasVQA (Reasoning-enhanced Video Question Answering) is a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA.
arXiv Detail & Related papers (2025-01-23T10:35:22Z)
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z)
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z)
First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment [15.99008977852437]
Affordance-Centric Question-driven Task Completion (AQTC) has been proposed to acquire from videos to users with comprehensive and systematic instructions. Existing methods have neglected the necessity of aligning visual and linguistic signals, as well as the crucial interactional information between humans objects. We propose to combine largescale pre-trained vision- and video-language models, which serve to contribute stable and reliable multimodal data.
arXiv Detail & Related papers (2023-06-23T09:02:25Z)
Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z)
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark. We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension. We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
Relation-aware Hierarchical Attention Framework for Video Question Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z)
HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos. We define a method based on general temporal, causal and physics rules which can be transferred across tasks. This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z)
Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
Multimodal Dialogue State Tracking By QA Approach with Data Augmentation [16.436557991074068]
This paper interprets the Audio-Video Scene-Aware Dialogue (AVSD) task from an open-domain Question Answering (QA) point of view. The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention. Our experiments show that our model and techniques bring significant improvements over the baseline model on the DSTC7-AVSD dataset.
arXiv Detail & Related papers (2020-07-20T06:23:18Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.