Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
- URL: http://arxiv.org/abs/2601.16333v1
- Date: Thu, 22 Jan 2026 21:40:08 GMT
- Title: Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
- Authors: Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle,
- Abstract summary: We study the ability of models to identify the most important sub-events in a video.<n>We evaluate models on their ability to distinguish between important and non-important sub-events in a game.
- Score: 11.490236862362801
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.
Related papers
- Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations [14.972702558607557]
We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets.<n>We propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results.<n>Our code is available on GitHub, and the datasets and trained models are available on Hugging Face.
arXiv Detail & Related papers (2025-10-20T08:58:23Z) - Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models [63.032359320629105]
We introduce: Unpaired Multimodal, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them.<n>We show that using unpaired data from auxiliary modalities consistently improves downstream performance across diverse unimodal targets such as image and audio.
arXiv Detail & Related papers (2025-10-09T17:32:23Z) - Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline [56.790045049514326]
Two major forms of deception dominate: human-crafted misinformation and AI-generated content.<n>We propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception.<n>UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines.
arXiv Detail & Related papers (2025-09-30T09:26:32Z) - Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional [40.11148315577635]
We present a large-scale empirical study to quantify dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs)<n>Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks.<n>We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies.<n>This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning.
arXiv Detail & Related papers (2025-09-27T21:13:29Z) - Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy [2.294223504228228]
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems.<n>Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning.<n>Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview.
arXiv Detail & Related papers (2024-12-23T18:15:19Z) - Promoting cross-modal representations to improve multimodal foundation models for physiological signals [3.630706646160043]
We use a masked autoencoding objective to pretrain a multimodal model.
We show that the model learns representations that can be linearly probed for a diverse set of downstream tasks.
We argue that explicit methods for inducing cross-modality may enhance multimodal pretraining strategies.
arXiv Detail & Related papers (2024-10-21T18:47:36Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.