Related papers: IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

URL: http://arxiv.org/abs/2506.21116v2
Date: Tue, 08 Jul 2025 02:46:17 GMT
Title: IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes
Authors: Yujia Liang, Jile Jiao, Xuetao Feng, Zixuan Ye, Yuan Wang, Zhicheng Wang,
Abstract summary: We introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios.<n>We then contribute a new model IPFormer-VideoLLM, which injection of instance-level features as instance prompts through an efficient attention-based connector.
Score: 20.662082715151886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

Related papers

ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts [64.93416171745693]
Reasoning Video Object is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query.<n>Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries.<n>We propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges.
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models [37.70850513700251]
Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot.<n>We propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation.<n>Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots.
arXiv Detail & Related papers (2025-05-12T15:22:28Z)
Multimodal Contextualized Support for Enhancing Video Retrieval System [0.0]
We propose a system that integrates a novel retrieval pipeline that extracts multimodal data, and incorporate information from multiple frames within a video.<n>The pipeline captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.
arXiv Detail & Related papers (2024-12-10T15:20:23Z)
EVC-MF: End-to-end Video Captioning Network with Multi-scale Features [13.85795110061781]
We propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning. It efficiently utilizes multi-scale visual and textual features to generate video descriptions. The results demonstrate that EVC-MF yields competitive performance compared with the state-of-theart methods.
arXiv Detail & Related papers (2024-10-22T02:16:02Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query. We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting. We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z)
Reliable Shot Identification for Complex Event Detection via Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos. Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability. An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z)
Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples. We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition [13.875674649636874]
We propose a Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition. FAMF aggregates face features and incorporates them with multi-modal information to identify persons in videos. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames.
arXiv Detail & Related papers (2020-10-19T08:06:40Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.