LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video
Question Answering
- URL: http://arxiv.org/abs/2111.14547v2
- Date: Tue, 30 Nov 2021 02:18:26 GMT
- Title: LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video
Question Answering
- Authors: Jingjing Jiang, Ziyi Liu, and Nanning Zheng
- Abstract summary: We propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR.
LiVLR first utilizes the graph-based Visual and Linguistic ablations to obtain multi-grained visual and linguistic representations.
The proposed LiVLR is lightweight and shows its performance advantage on two VideoQA benchmarks.
- Score: 50.11756459499762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Question Answering (VideoQA), aiming to correctly answer the given
question based on understanding multi-modal video content, is challenging due
to the rich video content. From the perspective of video understanding, a good
VideoQA framework needs to understand the video content at different semantic
levels and flexibly integrate the diverse video content to distill
question-related content. To this end, we propose a Lightweight
Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first
utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained
visual and linguistic representations. Subsequently, the obtained
representations are integrated with the devised Diversity-aware
Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference
between the different types of representations and can flexibly adjust the
importance of different types of representations when generating the
question-related joint representation, which is an effective and general
representation integration method. The proposed LiVLR is lightweight and shows
its performance advantage on two VideoQA benchmarks, MRSVTT-QA and KnowIT VQA.
Extensive ablation studies demonstrate the effectiveness of LiVLR key
components.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning.
It embeds question awareness directly within the vision encoder.
This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.